You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@flink.apache.org by "Till Rohrmann (Jira)" <ji...@apache.org> on 2020/06/15 08:51:00 UTC

[jira] [Created] (FLINK-18293) TaskExecutor offering non empty slots can lead to resource violation

Till Rohrmann created FLINK-18293:
-------------------------------------

             Summary: TaskExecutor offering non empty slots can lead to resource violation
                 Key: FLINK-18293
                 URL: https://issues.apache.org/jira/browse/FLINK-18293
             Project: Flink
          Issue Type: Bug
          Components: Runtime / Coordination
    Affects Versions: 1.10.1, 1.11.0
            Reporter: Till Rohrmann
             Fix For: 1.12.0


When a {{JobMaster}} loses leadership, then the {{TaskExecutor}} will fail all running tasks belonging to this job and transition all slots belonging to this job from {{ACTIVE}} into {{ALLOCATED}}. The idea is that these slots can be re-offered to the new leader of the very same job.

A problem arises when the {{Task}} cancellation takes longer than the election of the new leader. In this case, the slot containing a {{CANCELLING}} task, will be offered to the new {{JobMaster}} as empty. The {{JobMaster}} not knowing that the slot still contains a resource consumer might deploy new tasks into it believing that these tasks can use all of the available resources. In the best case, the newly deployed {{Tasks}} will simply get fewer resources than thought. In the worst case this will lead to a resource violation.

W/o the {{JobMaster}} being able to reconcile the state of already deployed {{Tasks}} into {{Slots}}, I believe that we should only re-offer the slot when it is free. One might model this scenario with introducing a new {{TaskSlotState.CLEANING}}. {{CLEANING}} means that the slot is still allocated for a given job but that there are still some resources which need to be cleaned up before it can be re-offered (transition to state {{ALLOCATED}}).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)