You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by "yanpengshi (Jira)" <ji...@apache.org> on 2022/05/15 03:02:00 UTC

[jira] [Commented] (FLINK-27354) JobMaster still processes requests while terminating

    [ https://issues.apache.org/jira/browse/FLINK-27354?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17537125#comment-17537125 ] 

yanpengshi commented on FLINK-27354:
------------------------------------

When the JobMaster receives  disconnectResourceManager RPC call  from the resourceManager, it will try to call 

reconnectToResourceManage and cause the problem.

 

If we set the  resourceManagerAddress as null when JobMaster shuts down, then JobMaster::

isConnectingToResourceManager will return false and JobMaster will not try to reconncet to resourceManager.

. What do you think? @[~mapohl] 

 

> JobMaster still processes requests while terminating
> ----------------------------------------------------
>
>                 Key: FLINK-27354
>                 URL: https://issues.apache.org/jira/browse/FLINK-27354
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Coordination
>    Affects Versions: 1.15.0, 1.13.6, 1.14.4
>            Reporter: Matthias Pohl
>            Priority: Major
>         Attachments: flink-logs.zip
>
>
> An issue was reported in the [user ML|https://lists.apache.org/thread/5pm3crntmb1hl17h4txnlhjz34clghrg] about the JobMaster trying to reconnect to the ResourceManager during shutdown.
> The JobMaster is disconnecting from the ResourceManager during shutdown (see [JobMaster:1182|https://github.com/apache/flink/blob/da532423487e0534b5fe61f5a02366833f76193a/flink-runtime/src/main/java/org/apache/flink/runtime/jobmaster/JobMaster.java#L1182]). This triggers the deregistration of the job in the {{ResourceManager}}. The RM responses asynchronously at the end of this deregistration through {{disconnectResourceManager}} (see [ResourceManager:993|https://github.com/apache/flink/blob/da532423487e0534b5fe61f5a02366833f76193a/flink-runtime/src/main/java/org/apache/flink/runtime/resourcemanager/ResourceManager.java#L993]) which will trigger a reconnect on the {{JobMaster}}'s side (see [JobMaster::disconnectResourceManager|https://github.com/apache/flink/blob/da532423487e0534b5fe61f5a02366833f76193a/flink-runtime/src/main/java/org/apache/flink/runtime/jobmaster/JobMaster.java#L789]) if it's still around because the {{resourceManagerAddress}} (used in {{isConnectingToResourceManager}}) is not cleared. This would only happen during a RM leader change.
> The {{disconnectResourceManager}} will be ignored if the {{JobMaster}} is gone already.
> We should add a guard in some way to {{JobMaster}} to avoid reconnecting to other components during shutdown. This might not only include the ResourceManager connection but might also affect other parts of the {{JobMaster}} API.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)