You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by "xiaogang zhou (Jira)" <ji...@apache.org> on 2024/01/09 08:53:00 UTC
[jira] [Comment Edited] (FLINK-33728) do not rewatch when KubernetesResourceManagerDriver watch fail

    [ https://issues.apache.org/jira/browse/FLINK-33728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17804638#comment-17804638 ] 

xiaogang zhou edited comment on FLINK-33728 at 1/9/24 8:52 AM:
---------------------------------------------------------------

[~xtsong] In a default FLINK setting, when the KubenetesClient  disconnects from KUBE API server, it will try to reconnect for infinitely times. As kubernetes.watch.reconnectLimit is -1. But KubenetesClient treat ResourceVersionTooOld as a special exception, as it will escape from the normal reconnects. And then it will cause FLINK FlinkKubeClient to retry connect for kubernetes.transactional-operation.max-retries times, and these retries have not interval between them. If the watcher does not recover, the JM will kill it self.

 

So I think the problem we are trying to solve is not only to avoid massive Flink jobs trying to re-creating watches at the same time.  But also how to allow FLINK to continue running even when the KUBE API is in a disorder situation. As for most of the times, FLINK TMs do not need to be bothered by a bad API server .

 

If you think it is not acceptable to recover the watcher only requesting resource, I think another possible way is , we can retry to rewatch pods periodically.

 

WDYT? :) 


was (Author: zhoujira86):
[~xtsong] In a default FLINK setting, when the KubenetesClient  disconnects from KUBE API server, it will try to reconnect for infinitely times. As kubernetes.watch.reconnectLimit is -1. But KubenetesClient treat ResourceVersionTooOld as a special exception, as it will escape from the normal reconnects. And then it will cause FLINK FlinkKubeClient to retry connect for kubernetes.transactional-operation.max-retries times. If the watcher does not recover, the JM will kill it self.

 

So I think the problem we are trying to solve is not only to avoid massive Flink jobs trying to re-creating watches at the same time.  But also how to allow FLINK to continue running even when the KUBE API is in a disorder situation. As for most of the times, FLINK TMs do not need to be bothered by a bad API server .

 

If you think it is not acceptable to recover the watcher only requesting resource, I think another possible way is , we can retry to rewatch pods periodically.

 

WDYT? :) 

> do not rewatch when KubernetesResourceManagerDriver watch fail
> --------------------------------------------------------------
>
>                 Key: FLINK-33728
>                 URL: https://issues.apache.org/jira/browse/FLINK-33728
>             Project: Flink
>          Issue Type: New Feature
>          Components: Deployment / Kubernetes
>            Reporter: xiaogang zhou
>            Priority: Major
>              Labels: pull-request-available
>
> I met massive production problem when kubernetes ETCD slow responding happen. After Kube recoverd after 1 hour, Thousands of Flink jobs using kubernetesResourceManagerDriver rewatched when recieving ResourceVersionTooOld,  which caused great pressure on API Server and made API server failed again... 
>  
> I am not sure is it necessary to
> getResourceEventHandler().onError(throwable)
> in  PodCallbackHandlerImpl# handleError method?
>  
> We can just neglect the disconnection of watching process. and try to rewatch once new requestResource called. And we can leverage on the akka heartbeat timeout to discover the TM failure, just like YARN mode do.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)