You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@airflow.apache.org by GitBox <gi...@apache.org> on 2022/10/13 12:33:16 UTC

[GitHub] [airflow] AutomationDev85 opened a new issue, #27032: Worker sometimes does not reconnect to redis/celery queue after crash

AutomationDev85 opened a new issue, #27032:
URL: https://github.com/apache/airflow/issues/27032

   ### Apache Airflow version
   
   2.4.1
   
   ### What happened
   
   We are running an Airflow deployment and we hat the issue that the redis POD died and then some Task stuck in the queue state. Only after killing the worker POD the tasks were consumed by the worker again. I wanted to analyse this more in detail and saw that this behavior only occurs sometimes!
   
   For me looks like the worker some times does not detect that the connection to the redis Pod broke:
   1) If I do not see any error in the log file the worker does NOT reconnect once the worker is back!
   2) If I see this error in the log of the Worker it is WORKING and Worker automatically reconnects:
   [2022-10-13 06:29:55,967: WARNING/MainProcess] consumer: Connection to broker lost. Trying to re-establish the connection...
   Traceback (most recent call last):
     File "/home/airflow/.local/lib/python3.8/site-packages/celery/worker/consumer/consumer.py", line 332, in start
       blueprint.start(self)
     File "/home/airflow/.local/lib/python3.8/site-packages/celery/bootsteps.py", line 116, in start
       step.start(parent)
     File "/home/airflow/.local/lib/python3.8/site-packages/celery/worker/consumer/consumer.py", line 628, in start
       c.loop(*c.loop_args())
     File "/home/airflow/.local/lib/python3.8/site-packages/celery/worker/loops.py", line 97, in asynloop
       next(loop)
     File "/home/airflow/.local/lib/python3.8/site-packages/kombu/asynchronous/hub.py", line 362, in create_loop
       cb(*cbargs)
     File "/home/airflow/.local/lib/python3.8/site-packages/kombu/transport/redis.py", line 1326, in on_readable
       self.cycle.on_readable(fileno)
     File "/home/airflow/.local/lib/python3.8/site-packages/kombu/transport/redis.py", line 562, in on_readable
       chan.handlers[type]()
     File "/home/airflow/.local/lib/python3.8/site-packages/kombu/transport/redis.py", line 906, in _receive
       ret.append(self._receive_one(c))
     File "/home/airflow/.local/lib/python3.8/site-packages/kombu/transport/redis.py", line 916, in _receive_one
       response = c.parse_response()
     File "/home/airflow/.local/lib/python3.8/site-packages/redis/client.py", line 3505, in parse_response
       response = self._execute(conn, conn.read_response)
     File "/home/airflow/.local/lib/python3.8/site-packages/redis/client.py", line 3479, in _execute
       return command(*args, **kwargs)
     File "/home/airflow/.local/lib/python3.8/site-packages/redis/connection.py", line 739, in read_response
       response = self._parser.read_response()
     File "/home/airflow/.local/lib/python3.8/site-packages/redis/connection.py", line 324, in read_response
       raw = self._buffer.readline()
     File "/home/airflow/.local/lib/python3.8/site-packages/redis/connection.py", line 256, in readline
       self._read_from_socket()
     File "/home/airflow/.local/lib/python3.8/site-packages/redis/connection.py", line 201, in _read_from_socket
       raise ConnectionError(SERVER_CLOSED_CONNECTION_ERROR)
   redis.exceptions.ConnectionError: Connection closed by server.
   
   ### What you think should happen instead
   
   Expected behavior is that the worker reconnects to redis automatically and starts consuming queues Tasks.
   
   ### How to reproduce
   
   1) Run DAG 2 tasks behind each other. 
   2) Then start DAG and during the first task is executed, force kill the redis POD (kubectl delete pod redis-0 -n ??? --grace-period=0  --force.) To simulate a crashing POD.
   3) Check if the worker reconnects automatically and executes next tasks or if task stuck in queue state and worker must be killed to fix this.
   
   ### Operating System
   
    AKSUbuntu-1804gen2
   
   ### Versions of Apache Airflow Providers
   
   _No response_
   
   ### Deployment
   
   Official Apache Airflow Helm Chart
   
   ### Deployment details
   
   Using a AKS Cluster in Azure to host Airflow.
   
   ### Anything else
   
   _No response_
   
   ### Are you willing to submit PR?
   
   - [ ] Yes I am willing to submit a PR!
   
   ### Code of Conduct
   
   - [X] I agree to follow this project's [Code of Conduct](https://github.com/apache/airflow/blob/main/CODE_OF_CONDUCT.md)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] jens-scheffler-bosch commented on issue #27032: Worker sometimes does not reconnect to redis/celery queue after crash

Posted by GitBox <gi...@apache.org>.
jens-scheffler-bosch commented on issue #27032:
URL: https://github.com/apache/airflow/issues/27032#issuecomment-1288431435

   > > As I see that the new Helm Chart 1.7.0 was released, does somebody know or expect if the new liveness probe on Celery Worker will fix this problem implicitly? (#25561)
   > 
   > I think the only way to check is to try it (maybe you can try it @jens-scheffler-bosch - that would help us to close the issue).
   
   We are on it - just have deployed via helm chart 1.7.0 - but as this problem only appeared randomly hard to predict if resolved. I'd be okay to close with the (positive) assumption that it is fixed and we maybe come back if we see another problem.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] potiuk commented on issue #27032: Worker sometimes does not reconnect to redis/celery queue after crash

Posted by GitBox <gi...@apache.org>.
potiuk commented on issue #27032:
URL: https://github.com/apache/airflow/issues/27032#issuecomment-1291388879

   Maybe just  keep it running for a while and let us know if ~ few days of running (depending on previously observed frequency) - if you will not see it after 2x the 'average" observeation time we might assume it works :)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] o-nikolas commented on issue #27032: Worker sometimes does not reconnect to redis/celery queue after crash

Posted by GitBox <gi...@apache.org>.
o-nikolas commented on issue #27032:
URL: https://github.com/apache/airflow/issues/27032#issuecomment-1278142197

   Thanks for filing this issue @AutomationDev85! I see that you're willing to submit a PR, let me know if you would like this issue assigned to you :smile: 
   
   As a follow-up: are you sure this is a bug in Airflow or is this actually a Celery issue?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] potiuk commented on issue #27032: Worker sometimes does not reconnect to redis/celery queue after crash

Posted by GitBox <gi...@apache.org>.
potiuk commented on issue #27032:
URL: https://github.com/apache/airflow/issues/27032#issuecomment-1306345997

   Closing. 2 weeks passed. @jens-scheffler-bosch - if you had any issue, you can comment here still.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] boring-cyborg[bot] commented on issue #27032: Worker sometimes does not reconnect to redis/celery queue after crash

Posted by GitBox <gi...@apache.org>.
boring-cyborg[bot] commented on issue #27032:
URL: https://github.com/apache/airflow/issues/27032#issuecomment-1277537973

   Thanks for opening your first issue here! Be sure to follow the issue template!
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] jens-scheffler-bosch commented on issue #27032: Worker sometimes does not reconnect to redis/celery queue after crash

Posted by GitBox <gi...@apache.org>.
jens-scheffler-bosch commented on issue #27032:
URL: https://github.com/apache/airflow/issues/27032#issuecomment-1279499452

   As I see that the new Helm Chart 1.7.0 was released, does somebody know or expect if the new liveness probe on Celery Worker will fix this problem implicitly? (https://github.com/apache/airflow/pull/25561)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] potiuk commented on issue #27032: Worker sometimes does not reconnect to redis/celery queue after crash

Posted by GitBox <gi...@apache.org>.
potiuk commented on issue #27032:
URL: https://github.com/apache/airflow/issues/27032#issuecomment-1288222139

   > As I see that the new Helm Chart 1.7.0 was released, does somebody know or expect if the new liveness probe on Celery Worker will fix this problem implicitly? (#25561)
   
   I think the only way to check is to try it.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] potiuk closed issue #27032: Worker sometimes does not reconnect to redis/celery queue after crash

Posted by GitBox <gi...@apache.org>.
potiuk closed issue #27032: Worker sometimes does not reconnect to redis/celery queue after crash 
URL: https://github.com/apache/airflow/issues/27032


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org