You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by "Sihua Zhou (JIRA)" <ji...@apache.org> on 2018/07/02 14:55:00 UTC

[jira] [Closed] (FLINK-9351) RM stop assigning slot to Job because the TM killed before connecting to JM successfully

     [ https://issues.apache.org/jira/browse/FLINK-9351?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sihua Zhou closed FLINK-9351.
-----------------------------
    Resolution: Duplicate

This issue have been fixed by the way in the PR of [FLINK-9456|https://issues.apache.org/jira/browse/FLINK-9456].

> RM stop assigning slot to Job because the TM killed before connecting to JM successfully
> ----------------------------------------------------------------------------------------
>
>                 Key: FLINK-9351
>                 URL: https://issues.apache.org/jira/browse/FLINK-9351
>             Project: Flink
>          Issue Type: Bug
>          Components: Distributed Coordination
>    Affects Versions: 1.5.0
>            Reporter: Sihua Zhou
>            Assignee: Sihua Zhou
>            Priority: Critical
>             Fix For: 1.6.0
>
>
> The steps are the following(copied from Stephan's comments in [5931|https://github.com/apache/flink/pull/5931]):
> - JobMaster / SlotPool requests a slot (AllocationID) from the ResourceManager
> - ResourceManager starts a container with a TaskManager
> - TaskManager registers at ResourceManager, which tells the TaskManager to push a slot to the JobManager.
> - TaskManager container is killed
> - The ResourceManager does not queue back the slot requests (AllocationIDs) that it sent to the previous TaskManager, so the requests are lost and need to time out before another attempt is tried.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)