You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@airflow.apache.org by GitBox <gi...@apache.org> on 2021/01/11 22:08:59 UTC

[GitHub] [airflow] jmullins edited a comment on issue #12644: Network instabilities are able to freeze KubernetesJobWatcher

jmullins edited a comment on issue #12644:
URL: https://github.com/apache/airflow/issues/12644#issuecomment-758154491


   We consistently experienced kubernetes executor slot starvation, as described above, where worker pods get stuck in a completed state and are never deleted due to indefinite blocking in the KubernetesJobWatcher watch:
   
   https://github.com/apache/airflow/blob/1.10.14/airflow/executors/kubernetes_executor.py#L315-L322
   https://github.com/apache/airflow/blob/1.10.14/airflow/executors/kubernetes_executor.py#L315-L322
   
   The indefinite blocking is due to a lack of tcp keepalives or a default _request_timeout (socket timeout) in kube_client_request_args:
   https://github.com/apache/airflow/blob/2.0.0/airflow/config_templates/default_airflow.cfg#L990
   
   We were able to consistently reproduce this behavior by injecting network faults or clearing the conntrack state  on the node where the scheduler was running as part of an overlay network.
   
   Setting a socket timeout, _request_timeout in kube_client_request_args, prevents executor slot starvation since the KubernetesJobWatcher recovers once the timeout is reached and properly cleans up worker pods stuck in the completed state.
   
   `
   kube_client_request_args = { "_request_timeout": 600 }
   `
   
   We currently set the _request_timeout to 10 minutes so we won't see a timeout unless there's a network fault -- since the kubernetes watch itself will expire before this (after 5 min).
   
   I think it makes sense to consider setting a default _request_timeout, even if the value is high, to protect against executor slot starvation and unavailability during network faults.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org