You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@airflow.apache.org by GitBox <gi...@apache.org> on 2021/01/11 18:57:46 UTC

[GitHub] [airflow] jmullins commented on issue #12644: Network instabilities are able to freeze KubernetesJobWatcher

jmullins commented on issue #12644:
URL: https://github.com/apache/airflow/issues/12644#issuecomment-758154491


   We consistently experienced kubernetes executor slot starvation, as described above, where worker pods get stuck in a completed state and are never deleted due to indefinite blocking in the KubernetesJobWatcher watch:
   
   https://github.com/apache/airflow/blob/1.10.14/airflow/executors/kubernetes_executor.py#L315-L322
   https://github.com/apache/airflow/blob/1.10.14/airflow/executors/kubernetes_executor.py#L315-L322
   
   The indefinite blocking is due to a lack of tcp keepalives or a default _request_timeout (socket timeout) in kube_client_request_args:
   https://github.com/apache/airflow/blob/2.0.0/airflow/config_templates/default_airflow.cfg#L990
   
   We were able to consistently reproduce this behavior by injecting network faults or stopping and starting the network link on the node where the scheduler was running as part of an overlay network. Note that if the scheduler is not running on an overlay network everything will work since the tcp connection is recoverable.
   
   Setting a socket timeout, _request_timeout in kube_client_request_args, prevents executor slot starvation since the KubernetesJobWatcher recovers once the timeout is reached and properly cleans up worker pods stuck in the completed state.
   
   `
   kube_client_request_args = { "_request_timeout": 600 }
   `
   
   We currently set the _request_timeout to 10 minutes so we won't see a timeout unless there's a network fault -- since the kubernetes watch itself will expire before this (after 5 min).
   
   I think it makes sense to consider setting a default _request_timeout, even if the value is high, to protect against executor slot starvation and unavailability during network faults.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org