You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@airflow.apache.org by "Roland de Boo (JIRA)" <ji...@apache.org> on 2018/08/30 12:36:00 UTC

[jira] [Comment Edited] (AIRFLOW-2966) KubernetesExecutor + namespace quotas kills scheduler if the pod can't be launched

    [ https://issues.apache.org/jira/browse/AIRFLOW-2966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16597385#comment-16597385 ] 

Roland de Boo edited comment on AIRFLOW-2966 at 8/30/18 12:35 PM:
------------------------------------------------------------------

Colleague of John here. Some additional info:
 * Updated to 1.10.0 and retried, same issue remains
 * Last observation in the log (not mentioned above):

{{[2018-08-30 12:19:46,967] \{jobs.py:1585} INFO - Exited execute loop}}

In the Pod I can see 2 other threads remaining, but they don't seem to do anything.

{{$ ps -ef}}

{{airflow 16 1 0 12:19 ? 00:00:02 /usr/local/bin/python /usr/local/bin/airflow scheduler -n -1}}
 {{airflow 38 16 0 12:19 ? 00:00:00 /usr/local/bin/python /usr/local/bin/airflow scheduler -n -1}}

The Pod is stuck but does not exit. So we need to kill it by hand.

 

If we increase the quota on the namespace, nothing happens to the scheduler.

 

 


was (Author: rdeboo):
Colleague of John here. Some additional info:
 * Updated to 1.10.0 and retried, same issue remains
 * Last observation in the log (not mentioned above):

{{[2018-08-30 12:19:46,967] \{jobs.py:1585} INFO - Exited execute loop}}

In the Pod I can see 2 other threads remaining, but they don't seem to do anything.

{{$ ps -ef}}

{{airflow 16 1 0 12:19 ? 00:00:02 /usr/local/bin/python /usr/local/bin/airflow scheduler -n -1}}
{{airflow 38 16 0 12:19 ? 00:00:00 /usr/local/bin/python /usr/local/bin/airflow scheduler -n -1}}

The Pod is stuck but does not exit. So we need to kill it by hand.

 

 

> KubernetesExecutor + namespace quotas kills scheduler if the pod can't be launched
> ----------------------------------------------------------------------------------
>
>                 Key: AIRFLOW-2966
>                 URL: https://issues.apache.org/jira/browse/AIRFLOW-2966
>             Project: Apache Airflow
>          Issue Type: Bug
>          Components: scheduler
>    Affects Versions: 1.10
>         Environment: Kubernetes 1.9.8
>            Reporter: John Hofman
>            Priority: Major
>
> When running Airflow in Kubernetes with the KubernetesExecutor and resource quota's set on the namespace Airflow is deployed in. If the scheduler tries to launch a pod into the namespace that exceeds the namespace limits it gets an ApiException, and crashes the scheduler.
> This stack trace is an example of the ApiException from the kubernetes client:
> {code:java}
> [2018-08-27 09:51:08,516] {pod_launcher.py:58} ERROR - Exception when attempting to create Namespaced Pod.
> Traceback (most recent call last):
> File "/src/apache-airflow/airflow/contrib/kubernetes/pod_launcher.py", line 55, in run_pod_async
> resp = self._client.create_namespaced_pod(body=req, namespace=pod.namespace)
> File "/usr/local/lib/python3.6/site-packages/kubernetes/client/apis/core_v1_api.py", line 6057, in create_namespaced_pod
> (data) = self.create_namespaced_pod_with_http_info(namespace, body, **kwargs)
> File "/usr/local/lib/python3.6/site-packages/kubernetes/client/apis/core_v1_api.py", line 6142, in create_namespaced_pod_with_http_info
> collection_formats=collection_formats)
> File "/usr/local/lib/python3.6/site-packages/kubernetes/client/api_client.py", line 321, in call_api
> _return_http_data_only, collection_formats, _preload_content, _request_timeout)
> File "/usr/local/lib/python3.6/site-packages/kubernetes/client/api_client.py", line 155, in __call_api
> _request_timeout=_request_timeout)
> File "/usr/local/lib/python3.6/site-packages/kubernetes/client/api_client.py", line 364, in request
> body=body)
> File "/usr/local/lib/python3.6/site-packages/kubernetes/client/rest.py", line 266, in POST
> body=body)
> File "/usr/local/lib/python3.6/site-packages/kubernetes/client/rest.py", line 222, in request
> raise ApiException(http_resp=r)
> kubernetes.client.rest.ApiException: (403)
> Reason: Forbidden
> HTTP response headers: HTTPHeaderDict({'Audit-Id': 'b00e2cbb-bdb2-41f3-8090-824aee79448c', 'Content-Type': 'application/json', 'Date': 'Mon, 27 Aug 2018 09:51:08 GMT', 'Content-Length': '410'})
> HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"pods \"podname-ec366e89ef934d91b2d3ffe96234a725\" is forbidden: exceeded quota: compute-resources, requested: limits.memory=4Gi, used: limits.memory=6508Mi, limited: limits.memory=10Gi","reason":"Forbidden","details":{"name":"podname-ec366e89ef934d91b2d3ffe96234a725","kind":"pods"},"code":403}{code}
>  
> I would expect the scheduler to catch the Exception and at least mark the task as failed, or better yet retry the task later.
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)