You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by "Gyula Fora (JIRA)" <ji...@apache.org> on 2018/10/02 08:52:00 UTC

[jira] [Commented] (FLINK-9635) Local recovery scheduling can cause spread out of tasks

    [ https://issues.apache.org/jira/browse/FLINK-9635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16635168#comment-16635168 ] 

Gyula Fora commented on FLINK-9635:
-----------------------------------

Should we consider this issue a blocker? I know the proper fix is very hard and a lot of effort but the current state is very unsafe as well.

> Local recovery scheduling can cause spread out of tasks
> -------------------------------------------------------
>
>                 Key: FLINK-9635
>                 URL: https://issues.apache.org/jira/browse/FLINK-9635
>             Project: Flink
>          Issue Type: Bug
>          Components: Distributed Coordination
>    Affects Versions: 1.5.0
>            Reporter: Till Rohrmann
>            Priority: Critical
>             Fix For: 1.7.0
>
>
> In order to make local recovery work, Flink's scheduling was changed such that it tries to be rescheduled to its previous location. In order to not occupy slots which have state of other tasks cached, the strategy will request a new slot if the old slot identified by the previous allocation id is no longer present. This also applies to newly allocated slots because there is no distinction between new or already used. This behaviour can cause that every tasks gets deployed to its own slot if the {{SlotPool}} has released all slots in the meantime, for example. The consequence could be that a job can no longer be executed after a failure because it needs more slots than before.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)