You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@flink.apache.org by "Yun Gao (Jira)" <ji...@apache.org> on 2022/04/13 06:28:06 UTC

[jira] [Updated] (FLINK-23216) RM keeps allocating and freeing slots after a TM lost until its heartbeat timeout

     [ https://issues.apache.org/jira/browse/FLINK-23216?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Yun Gao updated FLINK-23216:
----------------------------
    Fix Version/s: 1.16.0

> RM keeps allocating and freeing slots after a TM lost until its heartbeat timeout
> ---------------------------------------------------------------------------------
>
>                 Key: FLINK-23216
>                 URL: https://issues.apache.org/jira/browse/FLINK-23216
>             Project: Flink
>          Issue Type: Improvement
>          Components: Runtime / Coordination
>    Affects Versions: 1.14.0, 1.13.1, 1.12.4
>            Reporter: Gen Luo
>            Priority: Major
>             Fix For: 1.15.0, 1.16.0
>
>
> In Flink 1.13, it's observed that the ResourceManager keeps allocating and freeing slots with a new TM when it's notified by yarn that a TM is lost. The behavior will continue until JM marks the TM as FAILED when its heartbeat timeout is reached. It can be easily reproduced by enlarging the akka.ask.timeout and heartbeat.timeout, for example to 10 min.
>  
> After tracking, we find the procedure should be like this:
> When a TM is killed, yarn will first receive the event and notify the RM.
> In Flink 1.13, RM uses declarative resource management to manage the slots. It will find a lack of resources when receiving the notification, and then request a new TM from yarn.
> RM will then require the new TM to connect and offer slots to JM.
> But from JM's point of view, all slots are fulfilled, since the lost TM is not considered disconnected yet, until the heartbeat timeout is reached, so JM will reject all slot offers.
> The new TM will find no slot serving for the JM, then disconnect from the JM.
> RM will then find a lack of resources again and go back to step3, requiring the new TM to connect and offer slots to JM, but It won't request another new TM from yarn.
>  
> The original log is lost but is like this:
> o.a.f.r.r.s.DefaultSlotStatusSyncer - Freeing slot xxx.
> ...(repeat serval lines for different slots)...
> o.a.f.r.r.s.DefaultSlotStatusSyncer - Starting allocation of slot xxx from container_xxx for job xxx.
> ...(repeat serval lines for different slots)...
>  
> This could be fixed in several ways, such as notifying JM as well the RM receives a TM lost notification, TMs do not offer slots until required, etc. But all these ways have side effects so may need further discussion. 
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)