You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by "john (Jira)" <ji...@apache.org> on 2022/01/27 02:59:00 UTC

[jira] [Updated] (FLINK-25832) When the TaskManager is closed, its associated slot is not set to the released state.

     [ https://issues.apache.org/jira/browse/FLINK-25832?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

john updated FLINK-25832:
-------------------------
    Description: 
I deployed a standalone flink cluster on k8s and enabled scheduler-mode=reactive. When Taskmanager is closed, I actively call the closeTaskManagerConnection method of ResourceManager. However, when AdaptiveScheduler actively starts to restart the job, it calls the cancel method of Execution at this time, but this method does not judge whether the status of its associated slot is Alive. The Taskmanager to which this slot belongs has been closed, and RpcTimeout is triggered at this time.
But when I change the cancel method of Execution, after judging whether the status of the slot is Alive before cancel, repeating the above operation is still invalid, that is, RpcTimeout will still be triggered. My problem is: Active in the ResourceManager's closeTaskManagerConnection method, does not affect the state of its associated allocated slot. I think this is a bug. We should optimize the behavior of cancel to speed up the execution of cancel.

!image-2022-01-27-10-55-14-758.png!!image-2022-01-27-10-55-59-119.png!

!image-2022-01-27-10-57-26-223.png!

  was:
I deployed a standalone flink cluster on k8s and enabled scheduler-mode=reactive. When Taskmanager is closed, I actively call the closeTaskManagerConnection method of ResourceManager. However, when AdaptiveScheduler actively starts to restart the job, it calls the cancel method of Execution at this time, but this method does not judge whether the status of its associated slot is Alive. The Taskmanager to which this slot belongs has been closed, and RpcTimeout is triggered at this time.
But when I change the cancel method of Execution, after judging whether the status of the slot is Alive before cancel, repeating the above operation is still invalid, that is, RpcTimeout will still be triggered. My problem is: Active in the ResourceManager's closeTaskManagerConnection method, does not affect the state of its associated allocated slot. I think this is a bug. We should optimize the behavior of cancel to speed up the execution of cancel.

!image-2022-01-27-10-55-59-119.png!

!image-2022-01-27-10-57-26-223.png!!image-2022-01-27-10-55-14-758.png!


> When the TaskManager is closed, its associated slot is not set to the released state.
> -------------------------------------------------------------------------------------
>
>                 Key: FLINK-25832
>                 URL: https://issues.apache.org/jira/browse/FLINK-25832
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Task
>    Affects Versions: 1.14.2, 1.14.3
>            Reporter: john
>            Priority: Major
>         Attachments: image-2022-01-27-10-55-14-758.png, image-2022-01-27-10-55-59-119.png, image-2022-01-27-10-57-26-223.png
>
>
> I deployed a standalone flink cluster on k8s and enabled scheduler-mode=reactive. When Taskmanager is closed, I actively call the closeTaskManagerConnection method of ResourceManager. However, when AdaptiveScheduler actively starts to restart the job, it calls the cancel method of Execution at this time, but this method does not judge whether the status of its associated slot is Alive. The Taskmanager to which this slot belongs has been closed, and RpcTimeout is triggered at this time.
> But when I change the cancel method of Execution, after judging whether the status of the slot is Alive before cancel, repeating the above operation is still invalid, that is, RpcTimeout will still be triggered. My problem is: Active in the ResourceManager's closeTaskManagerConnection method, does not affect the state of its associated allocated slot. I think this is a bug. We should optimize the behavior of cancel to speed up the execution of cancel.
> !image-2022-01-27-10-55-14-758.png!!image-2022-01-27-10-55-59-119.png!
> !image-2022-01-27-10-57-26-223.png!



--
This message was sent by Atlassian Jira
(v8.20.1#820001)