You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@airflow.apache.org by "Roland de Boo (Jira)" <ji...@apache.org> on 2019/08/22 07:49:00 UTC

[jira] [Updated] (AIRFLOW-5282) KubernetesExecutor hangs when kubernetes API times out

     [ https://issues.apache.org/jira/browse/AIRFLOW-5282?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Roland de Boo updated AIRFLOW-5282:
-----------------------------------
    Description: 
While migrating to a different Kubernetes cluster, we observe that the scheduler hangs very frequently. No output is generated in the logs. The UI states:

`{{The scheduler does not appear to be running. Last heartbeat was received 9 minutes ago.`}}

I've attached py-spy to the scheduler process to investigate. This is the output:

 
{code:java}
  %Own   %Total  OwnTime  TotalTime  Function (filename:line)
 100.00% 100.00%   38.00s    38.00s   recv_into (OpenSSL/SSL.py:1821)
  0.00% 100.00%   0.000s    38.00s   sync (airflow/contrib/executors/kubernetes_executor.py:795)
  0.00% 100.00%   0.000s    38.00s   request_encode_body (urllib3/request.py:150)
  0.00% 100.00%   0.000s    38.00s   begin (http/client.py:307)
  0.00% 100.00%   0.000s    38.00s   recv_into (urllib3/contrib/pyopenssl.py:304)
  0.00% 100.00%   0.000s    38.00s   wrapper (airflow/utils/cli.py:74)
  0.00% 100.00%   0.000s    38.00s   create_namespaced_pod_with_http_info (kubernetes/client/apis/core_v1_api.py:6206)
  0.00% 100.00%   0.000s    38.00s   _make_request (urllib3/connectionpool.py:383)
  0.00% 100.00%   0.000s    38.00s   _execute_helper (airflow/jobs/scheduler_job.py:1412){code}

 
 Next, I've added debug logging to `cli.py` on urlib3. This confirmed that the scheduler hangs at the point where no response comes from the kubernetes API:
{code:java}

(base) macbook:~ roland$ kubectl logs airflow-scheduler-7c5d6df8f4-w4tpn > logs.txt
(base) macbook:~ roland$ cat logs.txt | grep pod_launcher.py
[2019-08-20 14:40:10,471] {pod_launcher.py:60} DEBUG - Pod Creation Request:
[2019-08-20 14:40:10,652] {pod_launcher.py:63} DEBUG - Pod Creation Response: {'api_version': 'v1',
[2019-08-20 14:43:03,213] {pod_launcher.py:60} DEBUG - Pod Creation Request:
[2019-08-20 14:43:03,296] {pod_launcher.py:63} DEBUG - Pod Creation Response: {'api_version': 'v1',
[2019-08-20 14:44:38,078] {pod_launcher.py:60} DEBUG - Pod Creation Request:
[2019-08-20 14:44:38,409] {pod_launcher.py:63} DEBUG - Pod Creation Response: {'api_version': 'v1',
[2019-08-20 14:44:39,804] {pod_launcher.py:60} DEBUG - Pod Creation Request:
[2019-08-20 14:44:39,867] {pod_launcher.py:63} DEBUG - Pod Creation Response: {'api_version': 'v1',
[2019-08-20 14:45:21,785] {pod_launcher.py:60} DEBUG - Pod Creation Request:
[2019-08-20 14:45:22,040] {pod_launcher.py:63} DEBUG - Pod Creation Response: {'api_version': 'v1',
[2019-08-20 14:50:22,224] {pod_launcher.py:60} DEBUG - Pod Creation Request:
{code}
It appears that the timeout can already be specified in the `airflow.cfg`, but the default is infinite. In kubernetes section: `kube_client_request_args = {"_request_timeout" : [60,60] }`

When I added the timeout, the scheduler will not hang anymore, but exit with an error (and be restarted by Kubernetes).

I propose a PR which:
 * Adds a default _request_timeout in airflow.cfg, so that default installations will not hang indefinitely
 * Catches the errors in the scheduler, and retries scheduling the task, in a similar fashion to how ApiExcptions are caught. (https://github.com/apache/airflow/blob/46885ccdbc8618f295b8f986da95c1abbb85bb02/airflow/executors/kubernetes_executor.py#L800-L803)

 

 

  was:
While migrating to a different Kubernetes cluster, we observe that the scheduler hangs very frequently. No output is generated in the logs. The UI states:

`{{The scheduler does not appear to be running. Last heartbeat was received 9 minutes ago.`}}

I've attached py-spy to the scheduler process to investigate. This is the output:

 
  %Own   %Total  OwnTime  TotalTime  Function (filename:line)
100.00% 100.00%   38.00s    38.00s   recv_into (OpenSSL/SSL.py:1821)
  0.00% 100.00%   0.000s    38.00s   sync (airflow/contrib/executors/kubernetes_executor.py:795)
  0.00% 100.00%   0.000s    38.00s   request_encode_body (urllib3/request.py:150)
  0.00% 100.00%   0.000s    38.00s   begin (http/client.py:307)
  0.00% 100.00%   0.000s    38.00s   recv_into (urllib3/contrib/pyopenssl.py:304)
  0.00% 100.00%   0.000s    38.00s   wrapper (airflow/utils/cli.py:74)
  0.00% 100.00%   0.000s    38.00s   create_namespaced_pod_with_http_info (kubernetes/client/apis/core_v1_api.py:6206)
  0.00% 100.00%   0.000s    38.00s   _make_request (urllib3/connectionpool.py:383)
  0.00% 100.00%   0.000s    38.00s   _execute_helper (airflow/jobs/scheduler_job.py:1412)
 
Next, I've added debug logging to `cli.py` on urlib3. This confirmed that the scheduler hangs at the point where no response comes from the kubernetes API:
(base) macbook:~ roland$ kubectl logs airflow-scheduler-7c5d6df8f4-w4tpn > logs.txt
(base) macbook:~ roland$ cat logs.txt | grep pod_launcher.py
[2019-08-20 14:40:10,471] \{pod_launcher.py:60} DEBUG - Pod Creation Request:
[2019-08-20 14:40:10,652] \{pod_launcher.py:63} DEBUG - Pod Creation Response: {'api_version': 'v1',
[2019-08-20 14:43:03,213] \{pod_launcher.py:60} DEBUG - Pod Creation Request:
[2019-08-20 14:43:03,296] \{pod_launcher.py:63} DEBUG - Pod Creation Response: {'api_version': 'v1',
[2019-08-20 14:44:38,078] \{pod_launcher.py:60} DEBUG - Pod Creation Request:
[2019-08-20 14:44:38,409] \{pod_launcher.py:63} DEBUG - Pod Creation Response: {'api_version': 'v1',
[2019-08-20 14:44:39,804] \{pod_launcher.py:60} DEBUG - Pod Creation Request:
[2019-08-20 14:44:39,867] \{pod_launcher.py:63} DEBUG - Pod Creation Response: {'api_version': 'v1',
[2019-08-20 14:45:21,785] \{pod_launcher.py:60} DEBUG - Pod Creation Request:
[2019-08-20 14:45:22,040] \{pod_launcher.py:63} DEBUG - Pod Creation Response: {'api_version': 'v1',
[2019-08-20 14:50:22,224] \{pod_launcher.py:60} DEBUG - Pod Creation Request:
It appears that the timeout can already be specified in the `airflow.cfg`, but the default is infinite. In kubernetes section: `kube_client_request_args = \{"_request_timeout" : [60,60] }`

When I added the timeout, the scheduler will not hang anymore, but exit with an error (and be restarted by Kubernetes).

I propose a PR which:
 * Adds a default _request_timeout in airflow.cfg, so that default installations will not hang indefinitely
 * Catches the errors in the scheduler, and retries scheduling the task, in a similar fashion to how ApiExcptions are caught.

 

 


> KubernetesExecutor hangs when kubernetes API times out
> ------------------------------------------------------
>
>                 Key: AIRFLOW-5282
>                 URL: https://issues.apache.org/jira/browse/AIRFLOW-5282
>             Project: Apache Airflow
>          Issue Type: Bug
>          Components: executor-kubernetes
>    Affects Versions: 1.10.4
>            Reporter: Roland de Boo
>            Assignee: Roland de Boo
>            Priority: Major
>
> While migrating to a different Kubernetes cluster, we observe that the scheduler hangs very frequently. No output is generated in the logs. The UI states:
> `{{The scheduler does not appear to be running. Last heartbeat was received 9 minutes ago.`}}
> I've attached py-spy to the scheduler process to investigate. This is the output:
>  
> {code:java}
>   %Own   %Total  OwnTime  TotalTime  Function (filename:line)
>  100.00% 100.00%   38.00s    38.00s   recv_into (OpenSSL/SSL.py:1821)
>   0.00% 100.00%   0.000s    38.00s   sync (airflow/contrib/executors/kubernetes_executor.py:795)
>   0.00% 100.00%   0.000s    38.00s   request_encode_body (urllib3/request.py:150)
>   0.00% 100.00%   0.000s    38.00s   begin (http/client.py:307)
>   0.00% 100.00%   0.000s    38.00s   recv_into (urllib3/contrib/pyopenssl.py:304)
>   0.00% 100.00%   0.000s    38.00s   wrapper (airflow/utils/cli.py:74)
>   0.00% 100.00%   0.000s    38.00s   create_namespaced_pod_with_http_info (kubernetes/client/apis/core_v1_api.py:6206)
>   0.00% 100.00%   0.000s    38.00s   _make_request (urllib3/connectionpool.py:383)
>   0.00% 100.00%   0.000s    38.00s   _execute_helper (airflow/jobs/scheduler_job.py:1412){code}
>  
>  Next, I've added debug logging to `cli.py` on urlib3. This confirmed that the scheduler hangs at the point where no response comes from the kubernetes API:
> {code:java}
> (base) macbook:~ roland$ kubectl logs airflow-scheduler-7c5d6df8f4-w4tpn > logs.txt
> (base) macbook:~ roland$ cat logs.txt | grep pod_launcher.py
> [2019-08-20 14:40:10,471] {pod_launcher.py:60} DEBUG - Pod Creation Request:
> [2019-08-20 14:40:10,652] {pod_launcher.py:63} DEBUG - Pod Creation Response: {'api_version': 'v1',
> [2019-08-20 14:43:03,213] {pod_launcher.py:60} DEBUG - Pod Creation Request:
> [2019-08-20 14:43:03,296] {pod_launcher.py:63} DEBUG - Pod Creation Response: {'api_version': 'v1',
> [2019-08-20 14:44:38,078] {pod_launcher.py:60} DEBUG - Pod Creation Request:
> [2019-08-20 14:44:38,409] {pod_launcher.py:63} DEBUG - Pod Creation Response: {'api_version': 'v1',
> [2019-08-20 14:44:39,804] {pod_launcher.py:60} DEBUG - Pod Creation Request:
> [2019-08-20 14:44:39,867] {pod_launcher.py:63} DEBUG - Pod Creation Response: {'api_version': 'v1',
> [2019-08-20 14:45:21,785] {pod_launcher.py:60} DEBUG - Pod Creation Request:
> [2019-08-20 14:45:22,040] {pod_launcher.py:63} DEBUG - Pod Creation Response: {'api_version': 'v1',
> [2019-08-20 14:50:22,224] {pod_launcher.py:60} DEBUG - Pod Creation Request:
> {code}
> It appears that the timeout can already be specified in the `airflow.cfg`, but the default is infinite. In kubernetes section: `kube_client_request_args = {"_request_timeout" : [60,60] }`
> When I added the timeout, the scheduler will not hang anymore, but exit with an error (and be restarted by Kubernetes).
> I propose a PR which:
>  * Adds a default _request_timeout in airflow.cfg, so that default installations will not hang indefinitely
>  * Catches the errors in the scheduler, and retries scheduling the task, in a similar fashion to how ApiExcptions are caught. (https://github.com/apache/airflow/blob/46885ccdbc8618f295b8f986da95c1abbb85bb02/airflow/executors/kubernetes_executor.py#L800-L803)
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.2#803003)