You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@flink.apache.org by "Till Rohrmann (Jira)" <ji...@apache.org> on 2022/01/27 12:58:00 UTC

[jira] [Created] (FLINK-25855) DefaultDeclarativeSlotPool rejects offered slots when the job is restarting

Till Rohrmann created FLINK-25855:
-------------------------------------

Summary: DefaultDeclarativeSlotPool rejects offered slots when the job is restarting
Key: FLINK-25855
URL: https://issues.apache.org/jira/browse/FLINK-25855
Project: Flink
Issue Type: Sub-task
Components: Runtime / Coordination
Affects Versions: 1.14.3, 1.15.0
Reporter: Till Rohrmann

The {{DefaultDeclarativeSlotPool}} rejects offered slots if the job is currently restarting. The problem is that in case of a job restart, the scheduler sets the required resources to zero. Hence, all offered slots will be rejected.

This is a problem for local recovery because rejected slots will be freed by the {{TaskExecutor}} and thereby all local state will be deleted. Hence, in order to properly support local recovery, we need to handle this situation somehow. I do see different options here:

h3. Accept excess slots
Accepting excess slots means that the {{DefaultDeclarativeSlotPool}} accepts slots which exceed the currently required set of slots.

Advantages:
* Easy to implement

Disadvantages:
* Offered slots that are not really needed will only be freed after the idle slot timeout. This means that some resources might be left unused for some time.

h3. Let DefaultDeclarativeSlotPool accept excess slots when job is restarting
Here the idea is to only accept excess slots when the job is currently restarting. This will required that the scheduler tells the {{DefaultDeclarativeSlotPool}} about the restarting state.

Advantages:
* We would only accept excess slots for the time of restarting

Disadvantages:
* We are complicating the semantics of the {{DefaultDeclarativeSlotPool}}. Moreover, we are introducing additional signals that communicate the restarting state to the pool.

h3. Don't immediately free slots on the TaskExecutor when they are rejected
Instead of freeing the slot immediately on the {{TaskExecutor}} after it is rejected. We could also retry for some time and only free the slot after some timeout.

Advantages:
* No changes on the JobMaster side needed.

Disadvantages:
* Complication of the slot lifecycle on the {{TaskExecutor}}
* Unneeded slots are not made available for other jobs as fast as possible

--
This message was sent by Atlassian Jira
(v8.20.1#820001)