You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@airflow.apache.org by GitBox <gi...@apache.org> on 2022/04/05 06:09:35 UTC

[GitHub] [airflow] cansjt commented on issue #21087: KubernetesJobWatcher failing on HTTP 410 errors, jobs stuck in scheduled state

cansjt commented on issue #21087:
URL: https://github.com/apache/airflow/issues/21087#issuecomment-1088304998

   @gkarg Read carefully. Where did I say there was a blocker? I asked how the Airflow team was planning to deal with this? I basically see two options:
   - Accept a work around, like the one you suggest (thanks for that);
   - Or wait for the Kubernetes client library to implement proper support for bookmark events;
   
   If the Kubernetes client library was handling bookmark events properly, the `_run()` method should/could simply output the updated resource version, instead of raising an exception.
   
   What you are suggesting might be a good work around for Airflow. I am not sure setting the resource version to 0 is okay. Is it equivalent to say to Airflow that we do not know the resource version and, somehow, Airflow will then retrieve the right value for it? I don't know Kubernetes' API well enough, I am sorry.
   
   Nonetheless, you cannot call this an Airflow issue. The root cause is the Python Kubernetes client library treats bookmark events not like valid events but like errors. I do not think that is a proper way. Nor do the kubernetes lib authors, [this comment](https://github.com/kubernetes-client/python-base/pull/234/files#diff-9b5753ebf2c77814b2fb9a12781c7f5ca6a48f41e2c1a356f24f53adc24a2f24R99) couldn't be any clearer. They did that to bypass a decoding error of the event payload, when bookmark events were added to Kubernetes. But never got back to it, to actually implement it. In particular, the bookmark event, should provide you with the actual revision of the object. But because they are not implemented, you can't have access to that information. Which means, if my understanding is correct, you'll probably have to make additional API calls, you should not need to make, that's the point of those events.
   
   Finding a work around is not the same as treating the root of a problem. The fact that we might be able to work around it, in Airflow, does not mean the feature should/must not be implemented by the Kubernetes client library for Python.
   
   Also consider this: couldn't there be other reasons that would have Kubernetes' API answers with a 410 HTTP Error? Maybe the `_run()` method only makes one single kubernetes API call, making the answer unambiguous, maybe it doesn't (I don't know, nor did I look for the answer).
   
   And you said it yourself:
   > this is probably too much of a hack
   
   Maybe... Though, honestly, having to periodically tear down and deploy again Airflow, is a bit of a hassle. Time I'd gladly spend on something else. So I'd be glad if the Airflow team would seriously consider your work around and thanks again for looking into this issue.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org