You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@airflow.apache.org by GitBox <gi...@apache.org> on 2022/01/25 08:50:30 UTC

[GitHub] [airflow] cansjt opened a new issue #21087: KubernetesJobWatcher failing on HTTP 410 errors, jobs stuck in scheduled state

cansjt opened a new issue #21087:
URL: https://github.com/apache/airflow/issues/21087


   ### Apache Airflow version
   
   2.2.3 (latest released)
   
   ### What happened
   
   After upgrading Airflow to 2.2.3 (from 2.2.2) and cncf.kubernetes provider to 3.0.1 (from 2.0.3) we started to see these errors in the logs:
   ```
   {"asctime": "2022-01-25 08:19:39", "levelname": "ERROR", "process": 565811, "name": "airflow.executors.kubernetes_executor.KubernetesJobWatcher", "funcName": "run", "lineno": 111, "message": "Unknown error in KubernetesJobWatcher. Failing", "exc_info": "Traceback (most recent call last):\n  File \"/usr/local/lib/python3.9/site-packages/airflow/executors/kubernetes_executor.py\", line 102, in run\n    self.resource_version = self._run(\n  File \"/usr/local/lib/python3.9/site-packages/airflow/executors/kubernetes_executor.py\", line 145, in _run\n    for event in list_worker_pods():\n  File \"/usr/local/lib/python3.9/site-packages/kubernetes/watch/watch.py\", line 182, in stream\n    raise client.rest.ApiException(\nkubernetes.client.exceptions.ApiException: (410)\nReason: Expired: too old resource version: 655595751 (655818065)\n"}
   Process KubernetesJobWatcher-6571:
   Traceback (most recent call last):
     File "/usr/local/lib/python3.9/multiprocessing/process.py", line 315, in _bootstrap
       self.run()
     File "/usr/local/lib/python3.9/site-packages/airflow/executors/kubernetes_executor.py", line 102, in run
       self.resource_version = self._run(
     File "/usr/local/lib/python3.9/site-packages/airflow/executors/kubernetes_executor.py", line 145, in _run
       for event in list_worker_pods():
     File "/usr/local/lib/python3.9/site-packages/kubernetes/watch/watch.py", line 182, in stream
       raise client.rest.ApiException(
   kubernetes.client.exceptions.ApiException: (410)
   Reason: Expired: too old resource version: 655595751 (655818065)
   ``` 
   Pods are created and run to completion, but it seems the KubernetesJobWatcher is incapable of seeing that they completed. From there Airflow goes to a complete halt. 
   
   ### What you expected to happen
   
   No errors in the logs and the job watcher does it's job of collecting completed jobs.
   
   ### How to reproduce
   
   I wish I knew. Trying to downgrade the cncf.kubernetes provider to previous versions to see if it helps.
   
   ### Operating System
   
   k8s (Airflow images are Debian based)
   
   ### Versions of Apache Airflow Providers
   
   apache-airflow-providers-amazon 2.6.0
   apache-airflow-providers-cncf-kubernetes 3.0.1
   apache-airflow-providers-ftp 2.0.1
   apache-airflow-providers-http 2.0.2
   apache-airflow-providers-imap 2.1.0
   apache-airflow-providers-postgres 2.4.0
   apache-airflow-providers-sqlite 2.0.1
   
   ### Deployment
   
   Other
   
   ### Deployment details
   
   The deployment is on k8s v1.19.16, made with helm3.
   
   ### Anything else
   
   This, in the symptoms, look a lot like #17629 but happens in a different place.
   Redeploying as suggested in that issues seemed to help, but most jobs that were supposed to run last night got stuck again. All jobs use the same pod template, without any customization.
   
   ### Are you willing to submit PR?
   
   - [ ] Yes I am willing to submit a PR!
   
   ### Code of Conduct
   
   - [X] I agree to follow this project's [Code of Conduct](https://github.com/apache/airflow/blob/main/CODE_OF_CONDUCT.md)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [airflow] cansjt commented on issue #21087: KubernetesJobWatcher failing on HTTP 410 errors, jobs stuck in scheduled state

Posted by GitBox <gi...@apache.org>.

cansjt commented on issue #21087:
URL: https://github.com/apache/airflow/issues/21087#issuecomment-1054001220


   To me, per the tickets referenced in #15500, the problem seem more in the Kubernetes Python's API. It does not handle the kind of messages we'd need for Airflow to recover when it's lost track of resources versions.
   
   Last I check there was no bug report there regarding that. So I did [open one](https://github.com/kubernetes-client/python-base/issues/286). I am sadly not the more knowledgeable to provide insight on how to best resolve this. @jedcunningham Maybe you can help?
   
   Until this is resolved on the k8s client side, what is the plan for Airflow? Wait? Resurrect the #15500 PR? (not sure if fully solves the issue)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [airflow] potiuk commented on issue #21087: KubernetesJobWatcher failing on HTTP 410 errors, jobs stuck in scheduled state

Posted by GitBox <gi...@apache.org>.

potiuk commented on issue #21087:
URL: https://github.com/apache/airflow/issues/21087#issuecomment-1035396420


   > More context here: [#15500 (comment)](https://github.com/apache/airflow/pull/15500#issuecomment-827885109)
   > 
   > Bottom line, especially now that we aren't pinned to `kubernetes==11`, we can probably handle this more gracefully now on our side.
   
   Thanks for the context - now I understand where it comes from! Yeah it isn't an easy one to handle!
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [airflow] cansjt commented on issue #21087: KubernetesJobWatcher failing on HTTP 410 errors, jobs stuck in scheduled state

Posted by GitBox <gi...@apache.org>.

cansjt commented on issue #21087:
URL: https://github.com/apache/airflow/issues/21087#issuecomment-1024424323


   After downgrading to 2.1.0, logs showed another error (but reported with the log level INFO!)
   ```
   {"asctime": "2022-01-25 17:24:18",
    "levelname": "INFO",
    "process": 1,
    "name": "airflow.executors.kubernetes_executor.KubernetesExecutor",
    "funcName": "_adopt_completed_pods",
    "lineno": 740,
    "message": "Failed to adopt pod <removed>. Reason: (403)
   Reason: Forbidden
   HTTP response headers: HTTPHeaderDict({'Audit-Id': '<removed>', 'Cache-Control': 'no-cache, private', 'Content-Type': 'application/json', 'X-Content-Type-Options': 'nosniff', 'Date': 'Tue, 25 Jan 2022 17:24:18 GMT', 'Content-Length': '414'})
   HTTP response body: {
     \"kind\":\"Status\",
     \"apiVersion\":\"v1\",
     \"metadata\":{},
     \"status\":\"Failure\",
     \"message\":\"pods \\\"<removed>\\\" is forbidden: User \\\"system:serviceaccount:<removed>\\\" cannot patch resource \\\"pods\\\" in API group \\\"\\\" in the namespace \\\"<removed>\\\"\",
     \"reason\":\"Forbidden\",
     \"details\":{\"name\":\"<removed>\",
                      \"kind\":\"pods\"},
     \"code\":403}
   
   "}
   ```
   Giving the missing permission to the scheduler's service account fixed the issue. Still need to upgrade the `cncf.kubernetes` provider again to be sure this was the root cause of the reported error or not.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [airflow] jedcunningham commented on issue #21087: KubernetesJobWatcher failing on HTTP 410 errors, jobs stuck in scheduled state

Posted by GitBox <gi...@apache.org>.

jedcunningham commented on issue #21087:
URL: https://github.com/apache/airflow/issues/21087#issuecomment-1035327244


   It's not that there are too many changes, at least like you are thinking. The executor is trying to watch from (history) revision n, which has rolled off of history on the k8s side. n+2 might be the oldest available now.
   
   This really isn't related to our deployments, and there isn't anything we can do in that regard to help here. In fact, you could hit this when running the scheduler outside k8s even.
   
   More context here: https://github.com/apache/airflow/pull/15500#issuecomment-827885109
   
   Bottom line, especially now that we aren't pinned to `kubernetes==11`, we can probably handle this more gracefully now on our side.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [airflow] potiuk commented on issue #21087: KubernetesJobWatcher failing on HTTP 410 errors, jobs stuck in scheduled state

Posted by GitBox <gi...@apache.org>.

potiuk commented on issue #21087:
URL: https://github.com/apache/airflow/issues/21087#issuecomment-1030832535


   I think this is also similar root cause as #12644 .   @dimberman @jedcunningham @kaxil  - or maybe somoene else who has some more experiences with K8S deployments in "real life" - this error "Resource too old" is returned by K8S when there are  too many changes to a version of K8S resource.
   
   But I am just wondering - it really happens IMHO because we deploy some changes "incrementally" too frequently (and too many times) in the chart/deployment ? Or maybe because we do NOT deploy the "full" deployment where we should? 
   
   I am not too experienced in long running K8S deployments, but for me it looks like something that this could be solved by identifiying which resources those are and implement some full "re-deployment" from time to time.
   
   It might be, that this is outside of our control as well, but I've seen some other people complaining about that recently so maybe we could have someone who has more insights there to take a look ?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [airflow] arkadiusz-bach commented on issue #21087: KubernetesJobWatcher failing on HTTP 410 errors, jobs stuck in scheduled state

Posted by GitBox <gi...@apache.org>.

arkadiusz-bach commented on issue #21087:
URL: https://github.com/apache/airflow/issues/21087#issuecomment-1053759135


   I have the same issue.
   
   Looks like it is happening, because they are now(don't know from which version) handling 410 error on kubernetes library side, there is one retry and then exception is raised if the event is of type = 'ERROR'
   
   I checked the kubernetes library and it was changed in this pull request:
   https://github.com/kubernetes-client/python-base/pull/133/files
   
   On Airflow Kubernetes Executor it is being handled here:
   https://github.com/apache/airflow/blob/d7265791187fb2117dfd090cdb7cce3f8c20866c/airflow/executors/kubernetes_executor.py#L148
   
   By process_error function, but it probably should be now enclosed by try catch ApiException with check for 410 code


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [airflow] cansjt edited a comment on issue #21087: KubernetesJobWatcher failing on HTTP 410 errors, jobs stuck in scheduled state

Posted by GitBox <gi...@apache.org>.

cansjt edited a comment on issue #21087:
URL: https://github.com/apache/airflow/issues/21087#issuecomment-1054001220


   To me, per the tickets referenced in #15500, the problem seem more in the Kubernetes Python's API. It does not handle the kind of messages we'd need for Airflow to recover when it's lost track of resources versions.
   
   Last I check there was no bug report there regarding that. So I did [open one](https://github.com/kubernetes-client/python-base/issues/286). I am sadly not the more knowledgeable to provide insight on how to best resolve this. @jedcunningham Maybe you can help?
   
   Until this is resolved on the k8s client side, what is the plan for Airflow? Wait? Resurrect the #15500 PR? (not sure if it fully solves the issue)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org