You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@airflow.apache.org by GitBox <gi...@apache.org> on 2021/04/01 13:31:36 UTC

[GitHub] [airflow] SamWheating opened a new pull request #15137: Adding retries when starting kubernetes pods

SamWheating opened a new pull request #15137:
URL: https://github.com/apache/airflow/pull/15137


   <!--
   Thank you for contributing! Please make sure that your code changes
   are covered with tests. And in case of new features or big changes
   remember to adjust the documentation.
   
   Feel free to ping committers for the review!
   
   In case of existing issue, reference it using one of the following:
   
   closes: #ISSUE
   related: #ISSUE
   
   How to write a good git commit message:
   http://chris.beams.io/posts/git-commit/
   -->
   
   closes: https://github.com/apache/airflow/issues/15097
   
   Adding retries to handle Kubernetes API Exceptions while trying to start a pod. 
   
   I've opted for a random exponential backoff since the issues we've been seeing have been the result of too many simultaneous requests, so using a retry without jitter could lead to just repeating the same race conditions over and over. 
   
   This will also retry in the event of _any_ kubernetes API Error. Is this alright? If we want to reduce the scope to just 409 errors as described in the linked issue its pretty easy to do so using tenacity's `retry=retry_if_exception(predicate)` functionality. 
   
   I'll try to run some experiments this morning to replicate the original issue and confirm that this fixes things. 
   
   ---
   **^ Add meaningful description above**
   
   Read the **[Pull Request Guidelines](https://github.com/apache/airflow/blob/master/CONTRIBUTING.rst#pull-request-guidelines)** for more information.
   In case of fundamental code change, Airflow Improvement Proposal ([AIP](https://cwiki.apache.org/confluence/display/AIRFLOW/Airflow+Improvements+Proposals)) is needed.
   In case of a new dependency, check compliance with the [ASF 3rd Party License Policy](https://www.apache.org/legal/resolved.html#category-x).
   In case of backwards incompatible changes please leave a note in [UPDATING.md](https://github.com/apache/airflow/blob/master/UPDATING.md).
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] ashb commented on a change in pull request #15137: Adding retries when starting kubernetes pods

Posted by GitBox <gi...@apache.org>.
ashb commented on a change in pull request #15137:
URL: https://github.com/apache/airflow/pull/15137#discussion_r605860870



##########
File path: airflow/kubernetes/pod_launcher.py
##########
@@ -98,6 +98,12 @@ def delete_pod(self, pod: V1Pod):
             if e.status != 404:
                 raise
 
+    @tenacity.retry(
+        stop=tenacity.stop_after_attempt(3),
+        wait=tenacity.wait_random_exponential(),
+        reraise=True,
+        retry=tenacity.retry_if_exception_type(ApiException),

Review comment:
       What does ApiException cover? Would this also be the same error if you send invalid payload?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] SamWheating commented on pull request #15137: Adding retries when starting kubernetes pods

Posted by GitBox <gi...@apache.org>.
SamWheating commented on pull request #15137:
URL: https://github.com/apache/airflow/pull/15137#issuecomment-812088826


   Since this requires changes to the pod_launcher and not the CNCF provider package, would it be possible to get this into the 2.0.2 release @ashb ?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] SamWheating commented on a change in pull request #15137: Adding retries when starting kubernetes pods

Posted by GitBox <gi...@apache.org>.
SamWheating commented on a change in pull request #15137:
URL: https://github.com/apache/airflow/pull/15137#discussion_r605972319



##########
File path: airflow/kubernetes/pod_launcher.py
##########
@@ -98,6 +98,12 @@ def delete_pod(self, pod: V1Pod):
             if e.status != 404:
                 raise
 
+    @tenacity.retry(
+        stop=tenacity.stop_after_attempt(3),
+        wait=tenacity.wait_random_exponential(),
+        reraise=True,
+        retry=tenacity.retry_if_exception_type(ApiException),

Review comment:
       Ok, I just pushed a commit to only retry on 409 ApiExceptions, let me know what you think. 
   
   Re: the location of this code - it looks like the pod launching code is also used by the KubernetesExecutor, so if you wanted to move the pod_launcher to the CNCF provider package you would likely need a copy within the Airflow package as well?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk merged pull request #15137: Adding retries when starting kubernetes pods

Posted by GitBox <gi...@apache.org>.
potiuk merged pull request #15137:
URL: https://github.com/apache/airflow/pull/15137


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] dimberman commented on a change in pull request #15137: Adding retries when starting kubernetes pods

Posted by GitBox <gi...@apache.org>.
dimberman commented on a change in pull request #15137:
URL: https://github.com/apache/airflow/pull/15137#discussion_r606390160



##########
File path: airflow/kubernetes/pod_launcher.py
##########
@@ -98,6 +98,12 @@ def delete_pod(self, pod: V1Pod):
             if e.status != 404:
                 raise
 
+    @tenacity.retry(
+        stop=tenacity.stop_after_attempt(3),
+        wait=tenacity.wait_random_exponential(),
+        reraise=True,
+        retry=tenacity.retry_if_exception_type(ApiException),

Review comment:
       @SamWheating I've created an issue here https://github.com/apache/airflow/issues/15164. Please comment on it and I will assign it to you :).
   
   I'd say you should attack this quickly, as without this fix we won't be able to release this fix until the next Airflow release  (and it will require an Airflow upgrade). 
   
   That said, I'm glad to make this PR a high priority, so once it's ready I can be fast with PR reviews to get it through sooner than later.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] github-actions[bot] commented on pull request #15137: Adding retries when starting kubernetes pods

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on pull request #15137:
URL: https://github.com/apache/airflow/pull/15137#issuecomment-812860240


   The PR most likely needs to run full matrix of tests because it modifies parts of the core of Airflow. However, committers might decide to merge it quickly and take the risk. If they don't merge it quickly - please rebase it to the latest master at your convenience, or amend the last commit of the PR, and push it with --force-with-lease.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] SamWheating commented on a change in pull request #15137: Adding retries when starting kubernetes pods

Posted by GitBox <gi...@apache.org>.
SamWheating commented on a change in pull request #15137:
URL: https://github.com/apache/airflow/pull/15137#discussion_r606326341



##########
File path: airflow/kubernetes/pod_launcher.py
##########
@@ -98,6 +98,12 @@ def delete_pod(self, pod: V1Pod):
             if e.status != 404:
                 raise
 
+    @tenacity.retry(
+        stop=tenacity.stop_after_attempt(3),
+        wait=tenacity.wait_random_exponential(),
+        reraise=True,
+        retry=tenacity.retry_if_exception_type(ApiException),

Review comment:
       Yeah, I just looked a little more and it appears that you're right. Its probably not too big of a change then to move a subset of the pod_launcher file into the cncf provider package and update some imports and such, and I'd definitely be interested in helping with that / taking that on. Would you like to write up an issue for that, or shall I?
   
   Regarding the issue in question, would you be OK with reviewing/merging this PR in the meantime? A larger refactor might take a while to get properly reviewed and this issue is causing a lot of failures on our end.   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] dimberman commented on a change in pull request #15137: Adding retries when starting kubernetes pods

Posted by GitBox <gi...@apache.org>.
dimberman commented on a change in pull request #15137:
URL: https://github.com/apache/airflow/pull/15137#discussion_r605948923



##########
File path: airflow/kubernetes/pod_launcher.py
##########
@@ -98,6 +98,12 @@ def delete_pod(self, pod: V1Pod):
             if e.status != 404:
                 raise
 
+    @tenacity.retry(
+        stop=tenacity.stop_after_attempt(3),
+        wait=tenacity.wait_random_exponential(),
+        reraise=True,
+        retry=tenacity.retry_if_exception_type(ApiException),

Review comment:
       @ashb @SamWheating this might be a good reason to move the pod_launching code into the cncf.kubernetes package. I don't think K8sPodOperators should require dependencies on Airflow for these kinds of fixes.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] SamWheating commented on a change in pull request #15137: Adding retries when starting kubernetes pods

Posted by GitBox <gi...@apache.org>.
SamWheating commented on a change in pull request #15137:
URL: https://github.com/apache/airflow/pull/15137#discussion_r605972319



##########
File path: airflow/kubernetes/pod_launcher.py
##########
@@ -98,6 +98,12 @@ def delete_pod(self, pod: V1Pod):
             if e.status != 404:
                 raise
 
+    @tenacity.retry(
+        stop=tenacity.stop_after_attempt(3),
+        wait=tenacity.wait_random_exponential(),
+        reraise=True,
+        retry=tenacity.retry_if_exception_type(ApiException),

Review comment:
       Ok, I just pushed a commit to only retry on 409 ApiExceptions, let me know what you think. 
   
   Re: the code structure - it looks like the pod launching code is also used by the KubernetesExecutor, so if you wanted to move the pod_launcher to the CNCF provider package you would likely need a copy within the Airflow package as well?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] SamWheating commented on a change in pull request #15137: Adding retries when starting kubernetes pods

Posted by GitBox <gi...@apache.org>.
SamWheating commented on a change in pull request #15137:
URL: https://github.com/apache/airflow/pull/15137#discussion_r605870761



##########
File path: airflow/kubernetes/pod_launcher.py
##########
@@ -98,6 +98,12 @@ def delete_pod(self, pod: V1Pod):
             if e.status != 404:
                 raise
 
+    @tenacity.retry(
+        stop=tenacity.stop_after_attempt(3),
+        wait=tenacity.wait_random_exponential(),
+        reraise=True,
+        retry=tenacity.retry_if_exception_type(ApiException),

Review comment:
       I believe so, I think it would also cover 5xx errors due to invalid credentials in which case there definitely isn't any point to retrying. 
   
   With this in mind, should we scope this retry down to only cover 409 errors? 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk commented on pull request #15137: Adding retries when starting kubernetes pods

Posted by GitBox <gi...@apache.org>.
potiuk commented on pull request #15137:
URL: https://github.com/apache/airflow/pull/15137#issuecomment-814421098


   should we merge that one and include it in the provider release :P ?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] dimberman commented on a change in pull request #15137: Adding retries when starting kubernetes pods

Posted by GitBox <gi...@apache.org>.
dimberman commented on a change in pull request #15137:
URL: https://github.com/apache/airflow/pull/15137#discussion_r606317466



##########
File path: airflow/kubernetes/pod_launcher.py
##########
@@ -98,6 +98,12 @@ def delete_pod(self, pod: V1Pod):
             if e.status != 404:
                 raise
 
+    @tenacity.retry(
+        stop=tenacity.stop_after_attempt(3),
+        wait=tenacity.wait_random_exponential(),
+        reraise=True,
+        retry=tenacity.retry_if_exception_type(ApiException),

Review comment:
       I think there are functions that are only used by the k8spodoperator/only used by the k8sexecutor, so wouldn't be TOO bad. 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] SamWheating commented on pull request #15137: Adding retries when starting kubernetes pods

Posted by GitBox <gi...@apache.org>.
SamWheating commented on pull request #15137:
URL: https://github.com/apache/airflow/pull/15137#issuecomment-814416685


   @dimberman - thanks for getting that refactor in so quickly 🎉
   
   I've rebased on master and moved my fix into the CNCF provider so this should be ready to re-review. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] ashb commented on a change in pull request #15137: Adding retries when starting kubernetes pods

Posted by GitBox <gi...@apache.org>.
ashb commented on a change in pull request #15137:
URL: https://github.com/apache/airflow/pull/15137#discussion_r605912371



##########
File path: airflow/kubernetes/pod_launcher.py
##########
@@ -98,6 +98,12 @@ def delete_pod(self, pod: V1Pod):
             if e.status != 404:
                 raise
 
+    @tenacity.retry(
+        stop=tenacity.stop_after_attempt(3),
+        wait=tenacity.wait_random_exponential(),
+        reraise=True,
+        retry=tenacity.retry_if_exception_type(ApiException),

Review comment:
       I think so, yeah probably.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org