You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@airflow.apache.org by GitBox <gi...@apache.org> on 2021/03/30 17:40:14 UTC

[GitHub] [airflow] SamWheating opened a new issue #15097: Errors when launching many pods simultaneously on GKE

SamWheating opened a new issue #15097:
URL: https://github.com/apache/airflow/issues/15097


   <!--
   
   Welcome to Apache Airflow!  For a smooth issue process, try to answer the following questions.
   Don't worry if they're not all applicable; just try to include what you can :-)
   
   If you need to include code snippets or logs, please put them in fenced code
   blocks.  If they're super-long, please use the details tag like
   <details><summary>super-long log</summary> lots of stuff </details>
   
   Please delete these comment blocks before submitting the issue.
   
   -->
   
   <!--
   
   IMPORTANT!!!
   
   PLEASE CHECK "SIMILAR TO X EXISTING ISSUES" OPTION IF VISIBLE
   NEXT TO "SUBMIT NEW ISSUE" BUTTON!!!
   
   PLEASE CHECK IF THIS ISSUE HAS BEEN REPORTED PREVIOUSLY USING SEARCH!!!
   
   Please complete the next sections or the issue will be closed.
   These questions are the first thing we need to know to understand the context.
   
   -->
   
   **Apache Airflow version**: 2.0.1
   
   
   **Kubernetes version (if you are using kubernetes)** (use `kubectl version`): 1.18.15-gke.1500
   
   **Environment**:
   
   - **Cloud provider or hardware configuration**: Google Cloud
   - **OS** (e.g. from /etc/os-release):
   - **Kernel** (e.g. `uname -a`):
   - **Install tools**: 
   - **Others**:
   
   **What happened**:
   
   When many pods are launched at the same time (typically through the kubernetesPodOperator), some will fail due to a 409 error encountered when modifying a resourceQuota object. 
   
   Full stack trace:
   ```
   Traceback (most recent call last):
     File "/usr/local/lib/python3.8/site-packages/airflow/models/taskinstance.py", line 1112, in _run_raw_task
       self._prepare_and_execute_task_with_callbacks(context, task)
     File "/usr/local/lib/python3.8/site-packages/airflow/models/taskinstance.py", line 1285, in _prepare_and_execute_task_with_callbacks
       result = self._execute_task(context, task_copy)
     File "/usr/local/lib/python3.8/site-packages/airflow/models/taskinstance.py", line 1310, in _execute_task
       result = task_copy.execute(context=context)
     File "/usr/local/lib/python3.8/site-packages/airflow/providers/cncf/kubernetes/operators/kubernetes_pod.py", line 339, in execute
       final_state, _, result = self.create_new_pod_for_operator(labels, launcher)
     File "/usr/local/lib/python3.8/site-packages/airflow/providers/cncf/kubernetes/operators/kubernetes_pod.py", line 485, in create_new_pod_for_operator
       launcher.start_pod(self.pod, startup_timeout=self.startup_timeout_seconds)
     File "/usr/local/lib/python3.8/site-packages/airflow/kubernetes/pod_launcher.py", line 109, in start_pod
       resp = self.run_pod_async(pod)
     File "/usr/local/lib/python3.8/site-packages/airflow/kubernetes/pod_launcher.py", line 87, in run_pod_async
       raise e
     File "/usr/local/lib/python3.8/site-packages/airflow/kubernetes/pod_launcher.py", line 81, in run_pod_async
       resp = self._client.create_namespaced_pod(
     File "/usr/local/lib/python3.8/site-packages/kubernetes/client/apis/core_v1_api.py", line 6115, in create_namespaced_pod
       (data) = self.create_namespaced_pod_with_http_info(namespace, body, **kwargs)
     File "/usr/local/lib/python3.8/site-packages/kubernetes/client/apis/core_v1_api.py", line 6193, in create_namespaced_pod_with_http_info
       return self.api_client.call_api('/api/v1/namespaces/{namespace}/pods', 'POST',
     File "/usr/local/lib/python3.8/site-packages/kubernetes/client/api_client.py", line 330, in call_api
       return self.__call_api(resource_path, method,
     File "/usr/local/lib/python3.8/site-packages/kubernetes/client/api_client.py", line 163, in __call_api
       response_data = self.request(method, url,
     File "/usr/local/lib/python3.8/site-packages/kubernetes/client/api_client.py", line 371, in request
       return self.rest_client.POST(url,
     File "/usr/local/lib/python3.8/site-packages/kubernetes/client/rest.py", line 260, in POST
       return self.request("POST", url,
     File "/usr/local/lib/python3.8/site-packages/kubernetes/client/rest.py", line 222, in request
       raise ApiException(http_resp=r)
   kubernetes.client.rest.ApiException: (409)
   Reason: Conflict
   HTTP response headers: HTTPHeaderDict({'Audit-Id': '9e2e6081-4e52-41fc-8caa-6db9d546990c', 'Cache-Control': 'no-cache, private', 'Content-Type': 'application/json', 'Date': 'Tue, 30 Mar 2021 15:41:33 GMT', 'Content-Length': '342'})
   HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"Operation cannot be fulfilled on resourcequotas \"gke-resource-quotas\": the object has been modified; please apply your changes to the latest version and try again","reason":"Conflict","details":{"name":"gke-resource-quotas","kind":"resourcequotas"},"code":409}
   ```
   
   This is a known issue in kubernetes, as outlined in this issue (in which other users specifically mention airflow): https://github.com/kubernetes/kubernetes/issues/67761
   
   While this can be handled by task retries, I would like to discuss whether its worth handling this error within the kubernetespodoperator itself. We could probably check for the error in the pod launcher and automatically retry a few times in this case. 
   
   Let me know if you think this is something worth fixing on our end. If so, please assign this issue to me and I can put up a PR in the next week or so. 
   
   If you think that this issue is best handled via task retries or fixed upstream in kubernetes, feel free to close this. 
   
   
   **What you expected to happen**:
   
   I would expect that Airflow could launch many pods at the same time. 
   
   <!-- What do you think went wrong? -->
   
   **How to reproduce it**:
   
   Create a DAG which runs 30+ kubernetespodoperator tasks at the same time. Likely a few will fail.
   
   <!---
   
   As minimally and precisely as possible. Keep in mind we do not have access to your cluster or dags.
   
   If you are using kubernetes, please attempt to recreate the issue using minikube or kind.
   
   ## Install minikube/kind
   
   - Minikube https://minikube.sigs.k8s.io/docs/start/
   - Kind https://kind.sigs.k8s.io/docs/user/quick-start/
   
   If this is a UI bug, please provide a screenshot of the bug or a link to a youtube video of the bug in action
   
   You can include images using the .md style of
   ![alt text](http://url/to/img.png)
   
   To record a screencast, mac users can use QuickTime and then create an unlisted youtube video with the resulting .mov file.
   
   --->
   
   
   **Anything else we need to know**:
   
   <!--
   
   How often does this problem occur? Once? Every time etc?
   
   Any relevant logs to include? Put them here in side a detail tag:
   <details><summary>x.log</summary> lots of stuff </details>
   
   -->
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] wseaton commented on issue #15097: Errors when launching many pods simultaneously on GKE

Posted by GitBox <gi...@apache.org>.
wseaton commented on issue #15097:
URL: https://github.com/apache/airflow/issues/15097#issuecomment-992660051


   We are actually running into this issue on `1.10.14` when launching 25+ pods at once (on OpenShift instead of GKE), is there a chance this fix could be backported to latest 1.10.X?
   
   ```
   [2021-12-09 22:44:32,525] {taskinstance.py:1150} ERROR - (409)
   Reason: Conflict
   HTTP response headers: HTTPHeaderDict({'Audit-Id': '88892258-bd32-49b8-9be7-2d43461fe952', 'Cache-Control': 'no-cache, private', 'Content-Type': 'application/json', 'X-Kubernetes-Pf-Flowschema-Uid': '5700ae9b-7d96-4990-86ac-16db6b730352', 'X-Kubernetes-Pf-Prioritylevel-Uid': '4b6183c1-e44a-482d-ac38-8994e7c8ba98', 'Date': 'Thu, 09 Dec 2021 22:44:32 GMT', 'Content-Length': '330'})
   HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"Operation cannot be fulfilled on resourcequotas \"my-quota\": the object has been modified; please apply your changes to the latest version and try again","reason":"Conflict","details":{"name":"my-quota","kind":"resourcequotas"},"code":409}
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] ashb commented on issue #15097: Errors when launching many pods simultaneously on GKE

Posted by GitBox <gi...@apache.org>.
ashb commented on issue #15097:
URL: https://github.com/apache/airflow/issues/15097#issuecomment-811449743


   @SamWheating Assigned, thanks


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] mik-laj commented on issue #15097: Errors when launching many pods simultaneously on GKE

Posted by GitBox <gi...@apache.org>.
mik-laj commented on issue #15097:
URL: https://github.com/apache/airflow/issues/15097#issuecomment-810653408


   > We could probably check for the error in the pod launcher and automatically retry a few times in this case.
   
   I can see that the Kubernetes ticket is now 2 years old, so I think we need to patch our side to better handle these situations. .


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] SamWheating commented on issue #15097: Errors when launching many pods simultaneously on GKE

Posted by GitBox <gi...@apache.org>.
SamWheating commented on issue #15097:
URL: https://github.com/apache/airflow/issues/15097#issuecomment-811274036


   👍 - Can you assign this issue to me and I'll open a PR to add some backoff logic to the operator?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] ashb commented on issue #15097: Errors when launching many pods simultaneously on GKE

Posted by GitBox <gi...@apache.org>.
ashb commented on issue #15097:
URL: https://github.com/apache/airflow/issues/15097#issuecomment-811450057


   I agree with you that cases like this where Airflow was never even able to _start_ the task don't feel like they should "consume" a retry attempt.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] wseaton commented on issue #15097: Errors when launching many pods simultaneously on GKE

Posted by GitBox <gi...@apache.org>.
wseaton commented on issue #15097:
URL: https://github.com/apache/airflow/issues/15097#issuecomment-992765823


   Thanks @potiuk, I was unaware that even security patches are no longer being backported. We are in the middle of a migration and this is another reason to pursue it even quicker. Cheers!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk commented on issue #15097: Errors when launching many pods simultaneously on GKE

Posted by GitBox <gi...@apache.org>.
potiuk commented on issue #15097:
URL: https://github.com/apache/airflow/issues/15097#issuecomment-992776040


   Good Luck! 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk closed issue #15097: Errors when launching many pods simultaneously on GKE

Posted by GitBox <gi...@apache.org>.
potiuk closed issue #15097:
URL: https://github.com/apache/airflow/issues/15097


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk edited a comment on issue #15097: Errors when launching many pods simultaneously on GKE

Posted by GitBox <gi...@apache.org>.
potiuk edited a comment on issue #15097:
URL: https://github.com/apache/airflow/issues/15097#issuecomment-992751634


   Airflow 1.10 is end of life as of June 2021 - and it stopped getting even critical security fixes. https://github.com/apache/airflow#version-life-cycle
   
   Please upgrade ASAP to Airflow 2. 
   
   In case you have not seen the latest Log4J security issue  - it does not affect Ariflow, but there might be fufure similar discoveries that might. So if you want to be sure that in case of similar problem you will get a fix fast - just make sure you are on Airflow 2.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk commented on issue #15097: Errors when launching many pods simultaneously on GKE

Posted by GitBox <gi...@apache.org>.
potiuk commented on issue #15097:
URL: https://github.com/apache/airflow/issues/15097#issuecomment-992751634


   Airflow 1.10 is end of life as of June 2021 - and it stopped getting even critical security fixes. https://github.com/apache/airflow#version-life-cycle
   
   Please upgrade ASAP to Airflow 2. 
   
   In case you have not seen the latest Log4J security issue  - it does not affect Ariflow, but there might be fufure similar discoveries that might. So if you want to be sure that in case of similar problem you will get a fix fast 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org