You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@airflow.apache.org by "Max (Jira)" <ji...@apache.org> on 2019/11/22 19:16:00 UTC

[jira] [Comment Edited] (AIRFLOW-6040) Airflow scheduler with kubernetes executor fails :- Unknown error in KubernetesJobWatcher

    [ https://issues.apache.org/jira/browse/AIRFLOW-6040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16980186#comment-16980186 ] 

Max edited comment on AIRFLOW-6040 at 11/22/19 7:15 PM:
--------------------------------------------------------

We ran into this same issue. I believe this is actually an issue in the upstream [kubernetes|[https://github.com/kubernetes-client/python]] package and not Airflow.

The exception is thrown from [this loop|https://github.com/apache/airflow/blob/1.10.6/airflow/contrib/executors/kubernetes_executor.py#L356]. It passes: {{label_selector="airflow-worker=<uuid>"}} to the {{list_namespaced_pod()}} method. When used in a {{Watch()}}, this doesn't return anything when there are no Pods that match the given UUID. The {{_request_timeout}} [config setting|https://github.com/apache/airflow/blob/1.10.6/airflow/config_templates/default_airflow.cfg#L828] causes the underlying {{urllib3}} library to throw a timeout exception which is unhandled by {{Watch()}}.

You can easily reproduce this by running a simple Python pod (in your Airflow namespace so it has the same ServiceAccount permissions) and executing the following snippet:
{code:bash}
$ kubectl -n <your-namespace> run -i -t python --image=python:3.7.4-slim-stretch --restart=Never --command -- /bin/sh
# pip install kubernetes
# python
>>> from kubernetes import config, client, watch
>>> from kubernetes.client.rest import ApiException
>>> config.load_incluster_config()
>>> k8s = client.CoreV1Api()
>>> watcher = watch.Watch()
>>> namespace = "<your-namespace>"
>>> for event in watcher.stream(k8s.list_namespaced_pod, namespace, resource_version="0", label_selector="airflow-worker=dont-find-this", _request_timeout=(60, 60)):
>>>     print(event['object'])
{code}
I've observed this behavior in both Airflow 1.10.5 & 1.10.6, Python 2.7 & Python 3.7, K8s 1.15 & K8s 1.16, urllib3 1.24 & urllib3 1.25.

Setting {{timeout_seconds=50}} in the [Watch() loop|https://github.com/apache/airflow/blob/1.10.6/airflow/contrib/executors/kubernetes_executor.py#L356]
will cause a warning instead of an exception. {{timeout_seconds}} targets the [list_namespaced_pod|https://github.com/kubernetes-client/python/blob/master/kubernetes/docs/CoreV1Api.md#list_namespaced_pod] method as opposed to the underlying urllib3 library.

Hope this helps others that are facing this issues.


was (Author: fl-max):
We ran into this same issue. I believe this is actually an issue in the upstream [kubernetes|[https://github.com/kubernetes-client/python]] package and not Airflow.

The exception is thrown from [this loop|https://github.com/apache/airflow/blob/1.10.6/airflow/contrib/executors/kubernetes_executor.py#L356]. It passes: {{label_selector="airflow-worker=<uuid>"}} to the {{list_namespaced_pod()}} method. When used in a {{Watch()}}, this doesn't return anything when there are no Pods that match the given UUID. The {{_request_timeout}} [config setting|https://github.com/apache/airflow/blob/1.10.6/airflow/config_templates/default_airflow.cfg#L828] causes the underlying {{urllib3}} library to throw a timeout exception which is unhandled by {{Watch()}}.

You can easily reproduce this by running a simple Python pod (in your Airflow namespace so it has the same ServiceAccount permissions) and executing the following snippet:
{code:bash}
$ kubectl -n <your-namespace> run -i -t python --image=python:3.7.4-slim-stretch --restart=Never --command -- /bin/sh
# pip install kubernetes
# python
>>> from kubernetes import config, client, watch
>>> from kubernetes.client.rest import ApiException
>>> config.load_incluster_config()
>>> k8s = client.CoreV1Api()
>>> watcher = watch.Watch()
>>> namespace = "<your-namespace>"
>>> for event in watcher.stream(k8s.list_namespaced_pod, namespace, resource_version="0", label_selector="airflow-worker=dont-find-this", _request_timeout=(60, 60)):
>>>     print(event['object'])
{code}
I've observed this behavior in both Airflow 1.10.5 & 1.10.6, Python 2.7 & Python 3.7, K8s 1.15 & K8s 1.16, urllib3 1.24 & urllib3 1.25.

As a workaround, setting [kube_client_request_args|https://github.com/apache/airflow/blob/1.10.6/airflow/config_templates/default_airflow.cfg#L828] to:
{noformat}
"{ \"_request_timeout\" : [60,60], \"timeout_seconds\" : 50 }"
{noformat}
will cause a warning instead of an exception. {{timeout_seconds}} targets the [list_namespaced_pod|https://github.com/kubernetes-client/python/blob/master/kubernetes/docs/CoreV1Api.md#list_namespaced_pod] method as opposed to the underlying urllib3 library.

Hope this helps others that are facing this issues.

> Airflow scheduler with kubernetes executor fails :- Unknown error in KubernetesJobWatcher
> -----------------------------------------------------------------------------------------
>
>                 Key: AIRFLOW-6040
>                 URL: https://issues.apache.org/jira/browse/AIRFLOW-6040
>             Project: Apache Airflow
>          Issue Type: Bug
>          Components: contrib, executor-kubernetes, scheduler
>    Affects Versions: 1.10.6
>            Reporter: Ashutosh Srivastava
>            Assignee: Daniel Imberman
>            Priority: Major
>
> I am trying to set up airflow with the kubernetes executor. I have cloned airflow 1.10.6 and am building the docker image and then deploying it with kube. The pods are running, the service airflow also starts. The webserver is working fine. But when I check the logs for the scheduler I get the following error.
>  
> {{ERROR - Error while health checking kube watcher process. Process died for unknown reasons
> INFO - Event: and now my watch begins starting at resource_version: 0
> ERROR - Unknown error in KubernetesJobWatcher. Failing
> Traceback (most recent call last):
>   File "/usr/local/lib/python2.7/dist-packages/airflow/contrib/executors/kubernetes_executor.py", line 333, in run
>     self.worker_uuid, self.kube_config)
>   File "/usr/local/lib/python2.7/dist-packages/airflow/contrib/executors/kubernetes_executor.py", line 358, in _run
>     **kwargs):
>   File "/usr/local/lib/python2.7/dist-packages/kubernetes/watch/watch.py", line 144, in stream
>     for line in iter_resp_lines(resp):
>   File "/usr/local/lib/python2.7/dist-packages/kubernetes/watch/watch.py", line 48, in iter_resp_lines
>     for seg in resp.read_chunked(decode_content=False):
>   File "/usr/local/lib/python2.7/dist-packages/urllib3/response.py", line 781, in read_chunked
>     self._original_response.close()
>   File "/usr/lib/python2.7/contextlib.py", line 35, in __exit__
>     self.gen.throw(type, value, traceback)
>   File "/usr/local/lib/python2.7/dist-packages/urllib3/response.py", line 439, in _error_catcher
>     raise ReadTimeoutError(self._pool, None, "Read timed out.")
> ReadTimeoutError: HTTPSConnectionPool(host='10.0.0.1', port=443): Read timed out.}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)