You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by "Till Rohrmann (Jira)" <ji...@apache.org> on 2022/01/31 10:24:00 UTC

[jira] [Commented] (FLINK-25893) ResourceManagerServiceImpl's lifecycle can lead to exceptions

    [ https://issues.apache.org/jira/browse/FLINK-25893?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17484597#comment-17484597 ] 

Till Rohrmann commented on FLINK-25893:
---------------------------------------

[~xtsong] could you take a look at this problem. I think you know this code part the best.

> ResourceManagerServiceImpl's lifecycle can lead to exceptions
> -------------------------------------------------------------
>
>                 Key: FLINK-25893
>                 URL: https://issues.apache.org/jira/browse/FLINK-25893
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Coordination
>    Affects Versions: 1.15.0, 1.14.3
>            Reporter: Till Rohrmann
>            Priority: Critical
>
> The {{ResourceManagerServiceImpl}} lifecycle can lead to exceptions when calling {{ResourceManagerServiceImpl.deregisterApplication}}. The problem arises when the {{DispatcherResourceManagerComponent}} is shutdown before the {{ResourceManagerServiceImpl}} gains leadership or while it is starting the {{ResourceManager}}.
> One problem is that {{deregisterApplication}} returns an exceptionally completed future if there is no leading {{ResourceManager}}.
> Another problem is that if there is a leading {{ResourceManager}}, then it can still be the case that it has not been started yet. If this is the case, then [ResourceManagerGateway.deregisterApplication|https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/resourcemanager/ResourceManagerServiceImpl.java#L143] will be discarded. The reason for this behaviour is that we create a {{ResourceManager}} in one {{Runnable}} and only start it in another. Due to this there can be the {{deregisterApplication}} call that gets the {{lock}} in between.
> I'd suggest to correct the lifecycle and contract of the {{ResourceManagerServiceImpl.deregisterApplication}}.
> Please note that due to this problem, the error reporting of this method has been suppressed. See FLINK-25885 for more details.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)