You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@airflow.apache.org by "ASF GitHub Bot (JIRA)" <ji...@apache.org> on 2018/09/27 09:21:00 UTC

[jira] [Commented] (AIRFLOW-2966) KubernetesExecutor + namespace quotas kills scheduler if the pod can't be launched

    [ https://issues.apache.org/jira/browse/AIRFLOW-2966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16630037#comment-16630037 ] 

ASF GitHub Bot commented on AIRFLOW-2966:
-----------------------------------------

johnhofman opened a new pull request #3960: [AIRFLOW-2966] Catch ApiException in the Kubernetes Executor
URL: https://github.com/apache/incubator-airflow/pull/3960
 
 
   ### Description
   
   Creating a pod that exceeds a namespace's resource quota throws an ApiException. This change catches the exception and the task is re-queued inside the Executor instead of killing the scheduler.
   
   `click 7.0` was recently released but `flask-appbuilder 1.11.1 has requirement click==6.7`. I have pinned `click==6.7` to make the dependencies resolve.
   
   ### Tests
   
   This adds a single test `TestKubernetesExecutor. test_run_next_exception` that covers this single scenario. Without the changes this test fails when the ApiException is not caught. 
   
   This is the first test case for the `KubernetesExecutor`,  so I needed to add the `[kubernetes]` section to `default_test.cfg` so that the `KubernetesExecutor` can be built without exceptions.
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


> KubernetesExecutor + namespace quotas kills scheduler if the pod can't be launched
> ----------------------------------------------------------------------------------
>
>                 Key: AIRFLOW-2966
>                 URL: https://issues.apache.org/jira/browse/AIRFLOW-2966
>             Project: Apache Airflow
>          Issue Type: Bug
>          Components: scheduler
>    Affects Versions: 2.0.0
>         Environment: Kubernetes 1.9.8
>            Reporter: John Hofman
>            Priority: Major
>
> When running Airflow in Kubernetes with the KubernetesExecutor and resource quota's set on the namespace Airflow is deployed in. If the scheduler tries to launch a pod into the namespace that exceeds the namespace limits it gets an ApiException, and crashes the scheduler.
> This stack trace is an example of the ApiException from the kubernetes client:
> {code:java}
> [2018-08-27 09:51:08,516] {pod_launcher.py:58} ERROR - Exception when attempting to create Namespaced Pod.
> Traceback (most recent call last):
> File "/src/apache-airflow/airflow/contrib/kubernetes/pod_launcher.py", line 55, in run_pod_async
> resp = self._client.create_namespaced_pod(body=req, namespace=pod.namespace)
> File "/usr/local/lib/python3.6/site-packages/kubernetes/client/apis/core_v1_api.py", line 6057, in create_namespaced_pod
> (data) = self.create_namespaced_pod_with_http_info(namespace, body, **kwargs)
> File "/usr/local/lib/python3.6/site-packages/kubernetes/client/apis/core_v1_api.py", line 6142, in create_namespaced_pod_with_http_info
> collection_formats=collection_formats)
> File "/usr/local/lib/python3.6/site-packages/kubernetes/client/api_client.py", line 321, in call_api
> _return_http_data_only, collection_formats, _preload_content, _request_timeout)
> File "/usr/local/lib/python3.6/site-packages/kubernetes/client/api_client.py", line 155, in __call_api
> _request_timeout=_request_timeout)
> File "/usr/local/lib/python3.6/site-packages/kubernetes/client/api_client.py", line 364, in request
> body=body)
> File "/usr/local/lib/python3.6/site-packages/kubernetes/client/rest.py", line 266, in POST
> body=body)
> File "/usr/local/lib/python3.6/site-packages/kubernetes/client/rest.py", line 222, in request
> raise ApiException(http_resp=r)
> kubernetes.client.rest.ApiException: (403)
> Reason: Forbidden
> HTTP response headers: HTTPHeaderDict({'Audit-Id': 'b00e2cbb-bdb2-41f3-8090-824aee79448c', 'Content-Type': 'application/json', 'Date': 'Mon, 27 Aug 2018 09:51:08 GMT', 'Content-Length': '410'})
> HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"pods \"podname-ec366e89ef934d91b2d3ffe96234a725\" is forbidden: exceeded quota: compute-resources, requested: limits.memory=4Gi, used: limits.memory=6508Mi, limited: limits.memory=10Gi","reason":"Forbidden","details":{"name":"podname-ec366e89ef934d91b2d3ffe96234a725","kind":"pods"},"code":403}{code}
>  
> I would expect the scheduler to catch the Exception and at least mark the task as failed, or better yet retry the task later.
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)