You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@airflow.apache.org by GitBox <gi...@apache.org> on 2020/12/08 14:32:29 UTC

[GitHub] [airflow] atrbgithub edited a comment on issue #12644: Network instabilities are able to freeze KubernetesJobWatcher

atrbgithub edited a comment on issue #12644:
URL: https://github.com/apache/airflow/issues/12644#issuecomment-740651751


   @guillemborrell We're also seeing this with kubernetes. We seem to hit:
   
   ```
   [2020-11-19 06:02:00,227] {kubernetes_executor.py:837} WARNING - HTTPError when attempting to run task, re-queueing. Exception: HTTPSConnectionPool(host='some_ip', port=443): Max retries exceeded with url: /api/v1/namespaces/some-namespace/pods (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7fcc36a8eee0>: Failed to establish a new connection: [Errno 111] Connection refused'))
   ```
   
   We're on version `1.10.12`. As you mention, pods then remain in the namespace with a status of completed and are not cleaned up. No more jobs appear to be submitted to k8s once this is hit. 
   
   We're seeing this regularly when kubernetes undergoes maintenance on the master node. 
   
   There is a similar issue here open against the python k8s client, which I believe airflow is using:
   https://github.com/kubernetes-client/python/issues/1148
   
   Separately we're seeing issues with the pod operator when using the kubernetes task executor. The parent pod appears to stop seeing events from the child pod it created. Thus it never sees that the child pod completes, and leaving the parent pod and hence the job hanging forever.  
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org