You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@airflow.apache.org by "MaksYermak (via GitHub)" <gi...@apache.org> on 2024/04/23 10:42:09 UTC

[PR] Add retry logic for KubernetesCreateResourceOperator and KubernetesJobOperator [airflow]

MaksYermak opened a new pull request, #39201:
URL: https://github.com/apache/airflow/pull/39201

   <!--
    Licensed to the Apache Software Foundation (ASF) under one
    or more contributor license agreements.  See the NOTICE file
    distributed with this work for additional information
    regarding copyright ownership.  The ASF licenses this file
    to you under the Apache License, Version 2.0 (the
    "License"); you may not use this file except in compliance
    with the License.  You may obtain a copy of the License at
   
      http://www.apache.org/licenses/LICENSE-2.0
   
    Unless required by applicable law or agreed to in writing,
    software distributed under the License is distributed on an
    "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
    KIND, either express or implied.  See the License for the
    specific language governing permissions and limitations
    under the License.
    -->
   
   <!--
   Thank you for contributing! Please make sure that your code changes
   are covered with tests. And in case of new features or big changes
   remember to adjust the documentation.
   
   Feel free to ping committers for the review!
   
   In case of an existing issue, reference it using one of the following:
   
   closes: #ISSUE
   related: #ISSUE
   
   How to write a good git commit message:
   http://chris.beams.io/posts/git-commit/
   -->
   
   In this PR I have added retry logic for KubernetesCreateResourceOperator and KubernetesJobOperator.
   
   This logic is needed for preventing 'No agent available' error. The error appears time to time when users try to create a Resource or Job. This issue is inside Kubernetes and in the current moment has no solution. Like a temporary solution we decided to retry Job or Resource creation request each time when this error appears.
   
   Link for the same issue for cert-manager service: https://github.com/cert-manager/cert-manager/issues/6457
   
   <!-- Please keep an empty line above the dashes. -->
   ---
   **^ Add meaningful description above**
   Read the **[Pull Request Guidelines](https://github.com/apache/airflow/blob/main/contributing-docs/05_pull_requests.rst#pull-request-guidelines)** for more information.
   In case of fundamental code changes, an Airflow Improvement Proposal ([AIP](https://cwiki.apache.org/confluence/display/AIRFLOW/Airflow+Improvement+Proposals)) is needed.
   In case of a new dependency, check compliance with the [ASF 3rd Party License Policy](https://www.apache.org/legal/resolved.html#category-x).
   In case of backwards incompatible changes please leave a note in a newsfragment file, named `{pr_number}.significant.rst` or `{issue_number}.significant.rst`, in [newsfragments](https://github.com/apache/airflow/tree/main/newsfragments).
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] Add retry logic for KubernetesCreateResourceOperator and KubernetesJobOperator [airflow]

Posted by "raphaelauv (via GitHub)" <gi...@apache.org>.
raphaelauv commented on PR #39201:
URL: https://github.com/apache/airflow/pull/39201#issuecomment-2072044512

   why don't you use the internal retry parameter of airflow ?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] Add retry logic for KubernetesCreateResourceOperator and KubernetesJobOperator [airflow]

Posted by "raphaelauv (via GitHub)" <gi...@apache.org>.
raphaelauv commented on PR #39201:
URL: https://github.com/apache/airflow/pull/39201#issuecomment-2072229080

   I was thinking about the BasOperator argument retries
   
   ```python
       PythonOperator(
           task_id="aa",
           retries=3,
           python_callable=toto,
       )
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] Add retry logic for KubernetesCreateResourceOperator and KubernetesJobOperator [airflow]

Posted by "dirrao (via GitHub)" <gi...@apache.org>.
dirrao commented on code in PR #39201:
URL: https://github.com/apache/airflow/pull/39201#discussion_r1578826472


##########
airflow/providers/cncf/kubernetes/hooks/kubernetes.py:
##########
@@ -486,6 +488,12 @@ def get_deployment_status(
         except Exception as exc:
             raise exc
 
+    @tenacity.retry(

Review Comment:
   This will be retried for non transient errors as well.  can't we rely on the task retry instead of explicit retry here?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] Add retry logic for KubernetesCreateResourceOperator and KubernetesJobOperator [airflow]

Posted by "potiuk (via GitHub)" <gi...@apache.org>.
potiuk merged PR #39201:
URL: https://github.com/apache/airflow/pull/39201


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] Add retry logic for KubernetesCreateResourceOperator and KubernetesJobOperator [airflow]

Posted by "raphaelauv (via GitHub)" <gi...@apache.org>.
raphaelauv commented on PR #39201:
URL: https://github.com/apache/airflow/pull/39201#issuecomment-2072235035

   related : https://github.com/apache/airflow/pull/15137


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] Add retry logic for KubernetesCreateResourceOperator and KubernetesJobOperator [airflow]

Posted by "vincbeck (via GitHub)" <gi...@apache.org>.
vincbeck commented on PR #39201:
URL: https://github.com/apache/airflow/pull/39201#issuecomment-2072414532

   > I was thinking about the BasOperator argument `retries`
   > 
   > ```python
   >     PythonOperator(
   >         task_id="aa",
   >         retries=3,
   >         python_callable=toto,
   >     )
   > ```
   
   This is an option for users. If a user wants to retry a specific task, they can use this parameter. Here, if I understand correctly, @MaksYermak wants to retry without the user being aware or needing to do something.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] Add retry logic for KubernetesCreateResourceOperator and KubernetesJobOperator [airflow]

Posted by "potiuk (via GitHub)" <gi...@apache.org>.
potiuk commented on code in PR #39201:
URL: https://github.com/apache/airflow/pull/39201#discussion_r1596646854


##########
airflow/providers/cncf/kubernetes/hooks/kubernetes.py:
##########
@@ -486,6 +488,12 @@ def get_deployment_status(
         except Exception as exc:
             raise exc
 
+    @tenacity.retry(

Review Comment:
   Yeah - it's more "resilience" than actual retry.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] Add retry logic for KubernetesCreateResourceOperator and KubernetesJobOperator [airflow]

Posted by "MaksYermak (via GitHub)" <gi...@apache.org>.
MaksYermak commented on PR #39201:
URL: https://github.com/apache/airflow/pull/39201#issuecomment-2072164596

   > why don't you use the internal retry parameter of airflow ?
   
   I use the same approach what we use for retry Pod creation: https://github.com/apache/airflow/blob/main/airflow/providers/cncf/kubernetes/utils/pod_manager.py#L347C1-L356C1
   Also, I didn't see any other approaches in the `cncf` package for retry functionality in this case I decided to use the same. Could you please share with me the example of Airflow's code for this internal retry parameter logic?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] Add retry logic for KubernetesCreateResourceOperator and KubernetesJobOperator [airflow]

Posted by "MaksYermak (via GitHub)" <gi...@apache.org>.
MaksYermak commented on PR #39201:
URL: https://github.com/apache/airflow/pull/39201#issuecomment-2098044673

   > Could you add tests to cover these retries?
   
   Sure, I have added a unit tests.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] Add retry logic for KubernetesCreateResourceOperator and KubernetesJobOperator [airflow]

Posted by "MaksYermak (via GitHub)" <gi...@apache.org>.
MaksYermak commented on code in PR #39201:
URL: https://github.com/apache/airflow/pull/39201#discussion_r1592404990


##########
airflow/providers/cncf/kubernetes/hooks/kubernetes.py:
##########
@@ -486,6 +488,12 @@ def get_deployment_status(
         except Exception as exc:
             raise exc
 
+    @tenacity.retry(

Review Comment:
   The idea was to retry HTTP request to Kubernetes when Kubernetes returns `No agent available` error. This issue is inside Kubernetes and appears time to time. I know that this error has 500 code which, also, code for non-transient errors. I do not see any problems for non-transient errors because the code will try to retry request only 3 times. If users have non-transient errors, after the third attempt they will see an exception.
   
   As for relying on task retry. I think it's not a good idea to retry the whole task because of a temporary problem with HTTP request.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] Add retry logic for KubernetesCreateResourceOperator and KubernetesJobOperator [airflow]

Posted by "VladaZakharova (via GitHub)" <gi...@apache.org>.
VladaZakharova commented on PR #39201:
URL: https://github.com/apache/airflow/pull/39201#issuecomment-2104254264

   Hi @raphaelauv @dirrao @vincbeck !
   Can you please check the changes here again? Thanks! 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org