You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@airflow.apache.org by GitBox <gi...@apache.org> on 2021/08/31 21:09:05 UTC

[GitHub] [airflow] jameslamb opened a new pull request #17953: add more information to PodLauncher timeout error

jameslamb opened a new pull request #17953:
URL: https://github.com/apache/airflow/pull/17953


   Thanks very much for all the work that has gone into v2 of `KubernetesPodOperator`!
   
   This PR proposes two small changes (one for documentation, one to a log message in `PodLauncher`), which I think might help others in debugging task failures when using this operator.
   
   ## Description
   
   In my recent experience with `KubernetesPodOperator` (using `airflow` 2.1.0), I've found that for some classes of issues which cause a task to fail, it can be difficult to diagnose them from only the information in the Airflow UI.
   
   For problems where kubernetes is able to create a pod but one or more of its containers fails to start, I've found that the task logs in the Airflow UI look something like this:
   
   ```text
   [WARN] Pod not yet started: some-pod-amhk4t
   [WARN] Pod not yet started: some-pod-amhk4t
   [WARN] Pod not yet started: some-pod-amhk4t
   ...
   ...
   packages/airflow/providers/cncf/kubernetes/utils/pod_launcher.py", line 131, in start_pod
       raise AirflowException("Pod took too long to start")
   airflow.exceptions.AirflowException: Pod took too long to start
   ```
   
   My first thought seeing that log message was "oh ok maybe image pulling is taking a while and I just need to set a higher timeout". My second thought was "ok..'too long' according to what configuration?".
   
   I've found that the Airflow task logs can look like that for any of the following issues:
   
   * referencing a secret that doesn't exist in the target namespace
   * requesting an `image` that your pod isn't authorized to pull (e.g., it's in a private repository and you failed to specify `imagePullSecrets`)
   * referencing a volume that does not exist
   
   In these cases, diagnosing the issue requires going directly to kubernetes to get the pod events.
   
   I hope that the changes in this PR might save others some debugging time in the future.
   
   Thanks very much for your time and consideration.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] github-actions[bot] commented on pull request #17953: add more information to PodLauncher timeout error

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on pull request #17953:
URL: https://github.com/apache/airflow/pull/17953#issuecomment-939245058


   The PR is likely OK to be merged with just subset of tests for default Python and Database versions without running the full matrix of tests, because it does not modify the core of Airflow. If the committers decide that the full tests matrix is needed, they will add the label 'full tests needed'. Then you should rebase to the latest main or amend the last commit of the PR, and push it with --force-with-lease.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] jedcunningham commented on a change in pull request #17953: add more information to PodLauncher timeout error

Posted by GitBox <gi...@apache.org>.
jedcunningham commented on a change in pull request #17953:
URL: https://github.com/apache/airflow/pull/17953#discussion_r727342467



##########
File path: airflow/providers/cncf/kubernetes/utils/pod_launcher.py
##########
@@ -128,7 +128,12 @@ def start_pod(self, pod: V1Pod, startup_timeout: int = 120):
                 self.log.warning("Pod not yet started: %s", pod.metadata.name)
                 delta = dt.now() - curr_time
                 if delta.total_seconds() >= startup_timeout:
-                    raise AirflowException("Pod took too long to start")
+                    msg = (
+                        f"Pod took longer than {startup_timeout} seconds to start. "
+                        "Increasing 'startup_timeout' might resolve this error, but check the pod events in kubernetes "
+                        "for structural errors like a missing imagePullSecret."

Review comment:
       > Logging events is turned off by default
   
   Ah, well that would do it πŸ™ƒ




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] boring-cyborg[bot] commented on pull request #17953: add more information to PodLauncher timeout error

Posted by GitBox <gi...@apache.org>.
boring-cyborg[bot] commented on pull request #17953:
URL: https://github.com/apache/airflow/pull/17953#issuecomment-909640224


   Congratulations on your first Pull Request and welcome to the Apache Airflow community! If you have any issues or are unsure about any anything please check our Contribution Guide (https://github.com/apache/airflow/blob/main/CONTRIBUTING.rst)
   Here are some useful points:
   - Pay attention to the quality of your code (flake8, mypy and type annotations). Our [pre-commits]( https://github.com/apache/airflow/blob/main/STATIC_CODE_CHECKS.rst#prerequisites-for-pre-commit-hooks) will help you with that.
   - In case of a new feature add useful documentation (in docstrings or in `docs/` directory). Adding a new operator? Check this short [guide](https://github.com/apache/airflow/blob/main/docs/apache-airflow/howto/custom-operator.rst) Consider adding an example DAG that shows how users should use it.
   - Consider using [Breeze environment](https://github.com/apache/airflow/blob/main/BREEZE.rst) for testing locally, it’s a heavy docker but it ships with a working Airflow and a lot of integrations.
   - Be patient and persistent. It might take some time to get a review or get the final approval from Committers.
   - Please follow [ASF Code of Conduct](https://www.apache.org/foundation/policies/conduct) for all communication including (but not limited to) comments on Pull Requests, Mailing list and Slack.
   - Be sure to read the [Airflow Coding style]( https://github.com/apache/airflow/blob/main/CONTRIBUTING.rst#coding-style-and-best-practices).
   Apache Airflow is a community-driven project and together we are making it better πŸš€.
   In case of doubts contact the developers at:
   Mailing List: dev@airflow.apache.org
   Slack: https://s.apache.org/airflow-slack
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] jedcunningham commented on pull request #17953: add more information to PodLauncher timeout error

Posted by GitBox <gi...@apache.org>.
jedcunningham commented on pull request #17953:
URL: https://github.com/apache/airflow/pull/17953#issuecomment-941582292


   Thanks @jameslamb, congrats on your second commit πŸŽ‰


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] jameslamb commented on a change in pull request #17953: add more information to PodLauncher timeout error

Posted by GitBox <gi...@apache.org>.
jameslamb commented on a change in pull request #17953:
URL: https://github.com/apache/airflow/pull/17953#discussion_r727316704



##########
File path: airflow/providers/cncf/kubernetes/utils/pod_launcher.py
##########
@@ -128,7 +128,12 @@ def start_pod(self, pod: V1Pod, startup_timeout: int = 120):
                 self.log.warning("Pod not yet started: %s", pod.metadata.name)
                 delta = dt.now() - curr_time
                 if delta.total_seconds() >= startup_timeout:
-                    raise AirflowException("Pod took too long to start")
+                    msg = (
+                        f"Pod took longer than {startup_timeout} seconds to start. "
+                        "Increasing 'startup_timeout' might resolve this error, but check the pod events in kubernetes "
+                        "for structural errors like a missing imagePullSecret."

Review comment:
       > I'd rather have no examples in every single exception and just point people to the events
   
   Ok sure! I'll update to that
   
   > Another factor here is that the events should also be logged in the task log (in theory). Did this not happen for you?
   
   Logging events is turned off by default in `KubernetesPodOperator`.
   
   https://github.com/apache/airflow/blob/b8d06e812ac56af6b0d17830c63b705ace9d4959/airflow/providers/cncf/kubernetes/operators/kubernetes_pod.py#L214




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] jedcunningham merged pull request #17953: add more information to PodLauncher timeout error

Posted by GitBox <gi...@apache.org>.
jedcunningham merged pull request #17953:
URL: https://github.com/apache/airflow/pull/17953


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] jedcunningham merged pull request #17953: add more information to PodLauncher timeout error

Posted by GitBox <gi...@apache.org>.
jedcunningham merged pull request #17953:
URL: https://github.com/apache/airflow/pull/17953


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] jedcunningham commented on a change in pull request #17953: add more information to PodLauncher timeout error

Posted by GitBox <gi...@apache.org>.
jedcunningham commented on a change in pull request #17953:
URL: https://github.com/apache/airflow/pull/17953#discussion_r727274888



##########
File path: airflow/providers/cncf/kubernetes/utils/pod_launcher.py
##########
@@ -128,7 +128,12 @@ def start_pod(self, pod: V1Pod, startup_timeout: int = 120):
                 self.log.warning("Pod not yet started: %s", pod.metadata.name)
                 delta = dt.now() - curr_time
                 if delta.total_seconds() >= startup_timeout:
-                    raise AirflowException("Pod took too long to start")
+                    msg = (
+                        f"Pod took longer than {startup_timeout} seconds to start. "
+                        "Increasing 'startup_timeout' might resolve this error, but check the pod events in kubernetes "
+                        "for structural errors like a missing imagePullSecret."

Review comment:
       I see where you are going but as an example, just last week I saw a case where no nodes had enough free resources. There are a ton of reasons why the timeout could be missed, I'm not sure singling out structural errors makes sense to me. I'd rather have no examples in every single exception and just point people to the events. I hear you on the double timeouts reference though. Maybe this instead? Short and sweet:
   
   ```
   f"Pod took longer than {startup_timeout} seconds to start (startup_timeout)."
   " Check the pod events in kubernetes to determine why."
   ```
   
   Another factor here is that the events should also be logged in the task log (in theory). Did this not happen for you? Feels a little weird to tell folks to "go look at k8s pod events" when it should be logged alongside the same message πŸ€·β€β™‚οΈ

##########
File path: airflow/providers/cncf/kubernetes/utils/pod_launcher.py
##########
@@ -128,7 +128,12 @@ def start_pod(self, pod: V1Pod, startup_timeout: int = 120):
                 self.log.warning("Pod not yet started: %s", pod.metadata.name)
                 delta = dt.now() - curr_time
                 if delta.total_seconds() >= startup_timeout:
-                    raise AirflowException("Pod took too long to start")
+                    msg = (
+                        f"Pod took longer than {startup_timeout} seconds to start. "
+                        "Increasing 'startup_timeout' might resolve this error, but check the pod events in kubernetes "
+                        "for structural errors like a missing imagePullSecret."

Review comment:
       > Logging events is turned off by default
   
   Ah, well that would do it πŸ™ƒ




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] jedcunningham commented on a change in pull request #17953: add more information to PodLauncher timeout error

Posted by GitBox <gi...@apache.org>.
jedcunningham commented on a change in pull request #17953:
URL: https://github.com/apache/airflow/pull/17953#discussion_r727274888



##########
File path: airflow/providers/cncf/kubernetes/utils/pod_launcher.py
##########
@@ -128,7 +128,12 @@ def start_pod(self, pod: V1Pod, startup_timeout: int = 120):
                 self.log.warning("Pod not yet started: %s", pod.metadata.name)
                 delta = dt.now() - curr_time
                 if delta.total_seconds() >= startup_timeout:
-                    raise AirflowException("Pod took too long to start")
+                    msg = (
+                        f"Pod took longer than {startup_timeout} seconds to start. "
+                        "Increasing 'startup_timeout' might resolve this error, but check the pod events in kubernetes "
+                        "for structural errors like a missing imagePullSecret."

Review comment:
       I see where you are going but as an example, just last week I saw a case where no nodes had enough free resources. There are a ton of reasons why the timeout could be missed, I'm not sure singling out structural errors makes sense to me. I'd rather have no examples in every single exception and just point people to the events. I hear you on the double timeouts reference though. Maybe this instead? Short and sweet:
   
   ```
   f"Pod took longer than {startup_timeout} seconds to start (startup_timeout)."
   " Check the pod events in kubernetes to determine why."
   ```
   
   Another factor here is that the events should also be logged in the task log (in theory). Did this not happen for you? Feels a little weird to tell folks to "go look at k8s pod events" when it should be logged alongside the same message πŸ€·β€β™‚οΈ




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] jameslamb commented on pull request #17953: add more information to PodLauncher timeout error

Posted by GitBox <gi...@apache.org>.
jameslamb commented on pull request #17953:
URL: https://github.com/apache/airflow/pull/17953#issuecomment-939227912


   @ephraimbuddy ok, I think this is ready for review!
   
   Sorry for the delay in responding to your request. I'm new to contributing to Airflow, so it took me a little while to figure out the test setup and how to mock my way into the point where the exception modified in this PR is raised.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] jedcunningham commented on pull request #17953: add more information to PodLauncher timeout error

Posted by GitBox <gi...@apache.org>.
jedcunningham commented on pull request #17953:
URL: https://github.com/apache/airflow/pull/17953#issuecomment-941582292


   Thanks @jameslamb, congrats on your second commit πŸŽ‰


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] jameslamb commented on a change in pull request #17953: add more information to PodLauncher timeout error

Posted by GitBox <gi...@apache.org>.
jameslamb commented on a change in pull request #17953:
URL: https://github.com/apache/airflow/pull/17953#discussion_r727330454



##########
File path: airflow/providers/cncf/kubernetes/utils/pod_launcher.py
##########
@@ -128,7 +128,12 @@ def start_pod(self, pod: V1Pod, startup_timeout: int = 120):
                 self.log.warning("Pod not yet started: %s", pod.metadata.name)
                 delta = dt.now() - curr_time
                 if delta.total_seconds() >= startup_timeout:
-                    raise AirflowException("Pod took too long to start")
+                    msg = (
+                        f"Pod took longer than {startup_timeout} seconds to start. "
+                        "Increasing 'startup_timeout' might resolve this error, but check the pod events in kubernetes "
+                        "for structural errors like a missing imagePullSecret."

Review comment:
       Simplified the message in https://github.com/apache/airflow/pull/17953/commits/b7e54b461f9bdd237ee4ed77fbe4d196bb53037b




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] jameslamb commented on a change in pull request #17953: add more information to PodLauncher timeout error

Posted by GitBox <gi...@apache.org>.
jameslamb commented on a change in pull request #17953:
URL: https://github.com/apache/airflow/pull/17953#discussion_r727316704



##########
File path: airflow/providers/cncf/kubernetes/utils/pod_launcher.py
##########
@@ -128,7 +128,12 @@ def start_pod(self, pod: V1Pod, startup_timeout: int = 120):
                 self.log.warning("Pod not yet started: %s", pod.metadata.name)
                 delta = dt.now() - curr_time
                 if delta.total_seconds() >= startup_timeout:
-                    raise AirflowException("Pod took too long to start")
+                    msg = (
+                        f"Pod took longer than {startup_timeout} seconds to start. "
+                        "Increasing 'startup_timeout' might resolve this error, but check the pod events in kubernetes "
+                        "for structural errors like a missing imagePullSecret."

Review comment:
       > I'd rather have no examples in every single exception and just point people to the events
   
   Ok sure! I'll update to that
   
   > Another factor here is that the events should also be logged in the task log (in theory). Did this not happen for you?
   
   Logging events is turned off by default in `KubernetesPodOperator`.
   
   https://github.com/apache/airflow/blob/b8d06e812ac56af6b0d17830c63b705ace9d4959/airflow/providers/cncf/kubernetes/operators/kubernetes_pod.py#L214

##########
File path: airflow/providers/cncf/kubernetes/utils/pod_launcher.py
##########
@@ -128,7 +128,12 @@ def start_pod(self, pod: V1Pod, startup_timeout: int = 120):
                 self.log.warning("Pod not yet started: %s", pod.metadata.name)
                 delta = dt.now() - curr_time
                 if delta.total_seconds() >= startup_timeout:
-                    raise AirflowException("Pod took too long to start")
+                    msg = (
+                        f"Pod took longer than {startup_timeout} seconds to start. "
+                        "Increasing 'startup_timeout' might resolve this error, but check the pod events in kubernetes "
+                        "for structural errors like a missing imagePullSecret."

Review comment:
       Simplified the message in https://github.com/apache/airflow/pull/17953/commits/b7e54b461f9bdd237ee4ed77fbe4d196bb53037b




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] jameslamb commented on pull request #17953: add more information to PodLauncher timeout error

Posted by GitBox <gi...@apache.org>.
jameslamb commented on pull request #17953:
URL: https://github.com/apache/airflow/pull/17953#issuecomment-913004948


   > We should also add a test to ensure this exception message is not altered in the future accidentally
   
   Ok sure, I can try to do that.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] jameslamb commented on a change in pull request #17953: add more information to PodLauncher timeout error

Posted by GitBox <gi...@apache.org>.
jameslamb commented on a change in pull request #17953:
URL: https://github.com/apache/airflow/pull/17953#discussion_r702305867



##########
File path: airflow/providers/cncf/kubernetes/utils/pod_launcher.py
##########
@@ -128,7 +128,12 @@ def start_pod(self, pod: V1Pod, startup_timeout: int = 120):
                 self.log.warning("Pod not yet started: %s", pod.metadata.name)
                 delta = dt.now() - curr_time
                 if delta.total_seconds() >= startup_timeout:
-                    raise AirflowException("Pod took too long to start")
+                    msg = (
+                        f"Pod took longer than {startup_timeout} seconds to start. "
+                        "Increasing 'startup_timeout' might resolve this error, but check the pod events in kubernetes "
+                        "for structural errors like a missing imagePullSecret."

Review comment:
       I think this addition might be confusing. The behavior I've mentioned in the PR description can happen if you reference a volume that does not exist, but I think some people might interpret "non-existing Volume" to mean "you forgot to add a necessary volume to the pod spec".
   
   I don't think it's necessary for this message to be an exhaustive list, so I think just having one example  and the advice to go look at pod events is enough.
   
   That said, I'm not a maintainer here so if you still prefer this addition I'd be happy to accept it.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] ephraimbuddy commented on a change in pull request #17953: add more information to PodLauncher timeout error

Posted by GitBox <gi...@apache.org>.
ephraimbuddy commented on a change in pull request #17953:
URL: https://github.com/apache/airflow/pull/17953#discussion_r700798716



##########
File path: airflow/providers/cncf/kubernetes/utils/pod_launcher.py
##########
@@ -128,7 +128,12 @@ def start_pod(self, pod: V1Pod, startup_timeout: int = 120):
                 self.log.warning("Pod not yet started: %s", pod.metadata.name)
                 delta = dt.now() - curr_time
                 if delta.total_seconds() >= startup_timeout:
-                    raise AirflowException("Pod took too long to start")
+                    msg = (
+                        f"Pod took longer than {startup_timeout} seconds to start. "
+                        "Increasing 'startup_timeout' might resolve this error, but check the pod events in kubernetes "
+                        "for structural errors like a missing imagePullSecret."

Review comment:
       ```suggestion
                           "for structural errors like a missing imagePullSecret, non-existing Volume etc"
   ```
   What do you think?
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] boring-cyborg[bot] commented on pull request #17953: add more information to PodLauncher timeout error

Posted by GitBox <gi...@apache.org>.
boring-cyborg[bot] commented on pull request #17953:
URL: https://github.com/apache/airflow/pull/17953#issuecomment-909640224


   Congratulations on your first Pull Request and welcome to the Apache Airflow community! If you have any issues or are unsure about any anything please check our Contribution Guide (https://github.com/apache/airflow/blob/main/CONTRIBUTING.rst)
   Here are some useful points:
   - Pay attention to the quality of your code (flake8, mypy and type annotations). Our [pre-commits]( https://github.com/apache/airflow/blob/main/STATIC_CODE_CHECKS.rst#prerequisites-for-pre-commit-hooks) will help you with that.
   - In case of a new feature add useful documentation (in docstrings or in `docs/` directory). Adding a new operator? Check this short [guide](https://github.com/apache/airflow/blob/main/docs/apache-airflow/howto/custom-operator.rst) Consider adding an example DAG that shows how users should use it.
   - Consider using [Breeze environment](https://github.com/apache/airflow/blob/main/BREEZE.rst) for testing locally, it’s a heavy docker but it ships with a working Airflow and a lot of integrations.
   - Be patient and persistent. It might take some time to get a review or get the final approval from Committers.
   - Please follow [ASF Code of Conduct](https://www.apache.org/foundation/policies/conduct) for all communication including (but not limited to) comments on Pull Requests, Mailing list and Slack.
   - Be sure to read the [Airflow Coding style]( https://github.com/apache/airflow/blob/main/CONTRIBUTING.rst#coding-style-and-best-practices).
   Apache Airflow is a community-driven project and together we are making it better πŸš€.
   In case of doubts contact the developers at:
   Mailing List: dev@airflow.apache.org
   Slack: https://s.apache.org/airflow-slack
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] jameslamb commented on a change in pull request #17953: add more information to PodLauncher timeout error

Posted by GitBox <gi...@apache.org>.
jameslamb commented on a change in pull request #17953:
URL: https://github.com/apache/airflow/pull/17953#discussion_r725575828



##########
File path: airflow/providers/cncf/kubernetes/utils/pod_launcher.py
##########
@@ -128,7 +128,12 @@ def start_pod(self, pod: V1Pod, startup_timeout: int = 120):
                 self.log.warning("Pod not yet started: %s", pod.metadata.name)
                 delta = dt.now() - curr_time
                 if delta.total_seconds() >= startup_timeout:
-                    raise AirflowException("Pod took too long to start")
+                    msg = (
+                        f"Pod took longer than {startup_timeout} seconds to start. "
+                        "Increasing 'startup_timeout' might resolve this error, but check the pod events in kubernetes "
+                        "for structural errors like a missing imagePullSecret."

Review comment:
       Sure! I've proposed some new language in https://github.com/apache/airflow/pull/17953/commits/a78f858fc73f426d1205025583e694ef898adc1d.
   
   I think another reference to timeouts (like "didn't start quickly enough") should be avoided. My goal with this pull request is to make it clearer that while this error is triggered by a timeout, it can be observed in situations where configuration or permission issues mean that the pod will NEVER start correctly.
   
   I opened this based on my personal experience using `KubernetesPodOperator`, where I didn't realize I had a typo in the name of an `imagePullSecret`, observed a task failing with this error about timeouts, and spent some time testing higher values of `startup_timeout` trying to resolve the issue.
   
   In my experience so far with `KubernetesPodOperator`, such k8s configuration errors are common and I hope that this error message will save others some time by making it clear that they should check k8s errors before just trying to increase the timeout.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] jedcunningham commented on a change in pull request #17953: add more information to PodLauncher timeout error

Posted by GitBox <gi...@apache.org>.
jedcunningham commented on a change in pull request #17953:
URL: https://github.com/apache/airflow/pull/17953#discussion_r725519410



##########
File path: airflow/providers/cncf/kubernetes/utils/pod_launcher.py
##########
@@ -128,7 +128,12 @@ def start_pod(self, pod: V1Pod, startup_timeout: int = 120):
                 self.log.warning("Pod not yet started: %s", pod.metadata.name)
                 delta = dt.now() - curr_time
                 if delta.total_seconds() >= startup_timeout:
-                    raise AirflowException("Pod took too long to start")
+                    msg = (
+                        f"Pod took longer than {startup_timeout} seconds to start. "
+                        "Increasing 'startup_timeout' might resolve this error, but check the pod events in kubernetes "
+                        "for structural errors like a missing imagePullSecret."

Review comment:
       Maybe we should be even more generic, something like this:
   
   ```suggestion
                           "to determine why the pod didn't start quickly enough."
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org