You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by "Till Rohrmann (Jira)" <ji...@apache.org> on 2022/01/31 17:51:00 UTC
[jira] [Closed] (FLINK-25855) DefaultDeclarativeSlotPool rejects offered slots when the job is restarting

     [ https://issues.apache.org/jira/browse/FLINK-25855?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Till Rohrmann closed FLINK-25855.
---------------------------------
    Fix Version/s: 1.15.0
       Resolution: Fixed

Fixed via fb14d4d9671eb91035d5103fb3ca814e5d02d6b6

> DefaultDeclarativeSlotPool rejects offered slots when the job is restarting
> ---------------------------------------------------------------------------
>
>                 Key: FLINK-25855
>                 URL: https://issues.apache.org/jira/browse/FLINK-25855
>             Project: Flink
>          Issue Type: Sub-task
>          Components: Runtime / Coordination
>    Affects Versions: 1.15.0, 1.14.3
>            Reporter: Till Rohrmann
>            Assignee: Till Rohrmann
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 1.15.0
>
>
> The {{DefaultDeclarativeSlotPool}} rejects offered slots if the job is currently restarting. The problem is that in case of a job restart, the scheduler sets the required resources to zero. Hence, all offered slots will be rejected.
> This is a problem for local recovery because rejected slots will be freed by the {{TaskExecutor}} and thereby all local state will be deleted. Hence, in order to properly support local recovery, we need to handle this situation somehow. I do see different options here:
> This problem only affects the {{DefaultScheduler}} since the {{AdaptiveScheduler}} sets the required resources when transitioning into the {{WaitingForResources}} state.
> h4. Accept excess slots
> Accepting excess slots means that the {{DefaultDeclarativeSlotPool}} accepts slots which exceed the currently required set of slots. 
> Advantages: 
> * Easy to implement
> Disadvantages:
> * Offered slots that are not really needed will only be freed after the idle slot timeout. This means that some resources might be left unused for some time.
> h4. Let DefaultDeclarativeSlotPool accept excess slots only when job is restarting
> Here the idea is to only accept excess slots when the job is currently restarting. This will required that the scheduler tells the {{DefaultDeclarativeSlotPool}} about the restarting state.
> Advantages:
> * We would only accept excess slots for the time of restarting
> Disadvantages:
> * We are complicating the semantics of the {{DefaultDeclarativeSlotPool}}. Moreover, we are introducing additional signals that communicate the restarting state to the pool.
> h4. Don't immediately free slots on the TaskExecutor when they are rejected
> Instead of freeing the slot immediately on the {{TaskExecutor}} after it is rejected. We could also retry for some time and only free the slot after some timeout.
> Advantages:
> * No changes on the JobMaster side needed.
> Disadvantages:
> * Complication of the slot lifecycle on the {{TaskExecutor}}
> * Unneeded slots are not made available for other jobs as fast as possible
> h4. Don't zero resource requirements during job restart
> Instead of zeroing the resource requirements during a job restart, we could also keep the last know requirements. Once the job is restarted, we could adjust the requirements.
> Advantages:
> * Conceptually easy to do
> Disadvantages:
> * The old requirements mustn't necessarily be the new ones
> * Convolutes logic in the scheduler



--
This message was sent by Atlassian Jira
(v8.20.1#820001)