You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@flink.apache.org by "Chesnay Schepler (Jira)" <ji...@apache.org> on 2022/09/22 14:34:00 UTC

[jira] [Created] (FLINK-29396) Race condition in JobMaster shutdown can leak resource requirements

Chesnay Schepler created FLINK-29396:
----------------------------------------

             Summary: Race condition in JobMaster shutdown can leak resource requirements
                 Key: FLINK-29396
                 URL: https://issues.apache.org/jira/browse/FLINK-29396
             Project: Flink
          Issue Type: Bug
          Components: Runtime / Coordination
    Affects Versions: 1.15.0
            Reporter: Chesnay Schepler


When a JobMaster is stopped it
a) sends a message to the RM informing it of the final job status
b) removes itself as the leader.

Once the JM loses leadership the RM is also informed about that.

With that we have 2 messages being sent to the RM at about the same time.
If the shutdown notifications arrives first (and job is in a terminal state) we wipe the resource requirements, and the leader loss notification is effectively ignored.
If the leader loss notification arrives first we keep the resource requirements, assuming that another JM will pick the job up later on, and the shutdown notification will be ignored.

This can cause a session cluster to essentially do nothing until the job timeout is triggered due to no leader being present (default 5 minutes).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)