You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@flink.apache.org by "Till Rohrmann (Jira)" <ji...@apache.org> on 2020/05/16 12:39:00 UTC

[jira] [Updated] (FLINK-15836) Throw fatal error in KubernetesResourceManager when the pods watcher is closed with exception

     [ https://issues.apache.org/jira/browse/FLINK-15836?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Till Rohrmann updated FLINK-15836:
----------------------------------
    Affects Version/s: 1.10.0

> Throw fatal error in KubernetesResourceManager when the pods watcher is closed with exception
> ---------------------------------------------------------------------------------------------
>
>                 Key: FLINK-15836
>                 URL: https://issues.apache.org/jira/browse/FLINK-15836
>             Project: Flink
>          Issue Type: Sub-task
>          Components: Deployment / Kubernetes
>    Affects Versions: 1.10.0
>            Reporter: Yang Wang
>            Assignee: Yang Wang
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 1.11.0, 1.10.2
>
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> As the discussion in the PR[1], if the {{watchReconnectLimit}} is configured by users via java properties or environment, the watch may be stopped and all the changes will not be processed properly. So we need to throw a fatal exception in {{KubernetesResourceManager}} when the old one is closed with exception.
>  
> > Why do we not create a new watcher in {{KubernetesResourceManager}} when old one closed exceptionally？
> After checking the {{WatchConnectionManager}} implementation in fabric8 kubernetes client, if the web socket closed exceptionally, it will check the {{reconnectLimit}} and schedule a reconnect if needed. And when reconnect successfully, the {{currentReconnectAttempt}} will reset to 0. By default, it will retry forever. When the users explicitly specify the reconnectLimit, we should respect it.
> Another reason is the the web socket closed exceptionally is usually because of network problems or port abuse. In such situation, it is better to fail the jobmanager pod and retry in a new one.
>  
> [1]. [https://github.com/apache/flink/pull/10965#discussion_r373491974]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)