You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by "Flink Jira Bot (Jira)" <ji...@apache.org> on 2022/07/13 22:38:00 UTC

[jira] [Updated] (FLINK-26773) ResourceManager leader election can a reconnect while shutting down the JobMaster

     [ https://issues.apache.org/jira/browse/FLINK-26773?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Flink Jira Bot updated FLINK-26773:
-----------------------------------
    Labels: pull-request-available stale-assigned  (was: pull-request-available)

I am the [Flink Jira Bot|https://github.com/apache/flink-jira-bot/] and I help the community manage its development. I see this issue is assigned but has not received an update in 30 days, so it has been labeled "stale-assigned".
If you are still working on the issue, please remove the label and add a comment updating the community on your progress.  If this issue is waiting on feedback, please consider this a reminder to the committer/reviewer. Flink is a very active project, and so we appreciate your patience.
If you are no longer working on the issue, please unassign yourself so someone else may work on it.


> ResourceManager leader election can a reconnect while shutting down the JobMaster
> ---------------------------------------------------------------------------------
>
>                 Key: FLINK-26773
>                 URL: https://issues.apache.org/jira/browse/FLINK-26773
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Coordination
>    Affects Versions: 1.14.4, 1.15.0, 1.16.0
>            Reporter: Matthias Pohl
>            Assignee: Jonathan Lazarus
>            Priority: Major
>              Labels: pull-request-available, stale-assigned
>         Attachments: FLINK-26773.failure-during-shutdown.log
>
>
> There's a race condition happening with the {{ResourceManager}} leader election in the {{JobMaster}} while shutting it down. The {{JobMaster}} calls {{dissolveResourceManagerConnection}} while shutting down itself trying to disconnect itself from the {{ResourceManager}} (see [JobMaster:1180|https://github.com/apache/flink/blob/fdb80108a3c0e4fb12dbc3f89ecb2327d97deebf/flink-runtime/src/main/java/org/apache/flink/runtime/jobmaster/JobMaster.java#L1180]).
> This closes the RM connection to the {{JobMaster}} from the {{ResourceManager}}'s side (see [ResourceManager:979|https://github.com/apache/flink/blob/9055279d0286f4374694325250a45dc1c60301a7/flink-runtime/src/main/java/org/apache/flink/runtime/resourcemanager/ResourceManager.java#L979]. The {{JobMaster}} tries to reconnect to the {{ResourceManager}} leader if there's still an address stored for that leader (which is the case during shutdown; see [JobMaster:790|https://github.com/apache/flink/blob/fdb80108a3c0e4fb12dbc3f89ecb2327d97deebf/flink-runtime/src/main/java/org/apache/flink/runtime/jobmaster/JobMaster.java#L790]).
> The {{JobMaster}} shouldn't try to reconnect after it has already freed it's requirements as part of the shutdown.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)