You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@flink.apache.org by "Chenyu Zheng (Jira)" <ji...@apache.org> on 2022/04/22 04:43:00 UTC

[jira] [Created] (FLINK-27350) JobManager doesn't bring up new TaskManager during failure recovery

Chenyu Zheng created FLINK-27350:
------------------------------------

             Summary: JobManager doesn't bring up new TaskManager during failure recovery
                 Key: FLINK-27350
                 URL: https://issues.apache.org/jira/browse/FLINK-27350
             Project: Flink
          Issue Type: Bug
            Reporter: Chenyu Zheng
         Attachments: jobmanager.log, stream-1376a7c25e714f06b2ca818af964c45c-taskmanager-1-10.log

I got a strange bug during failure recovery of Flink. It seems the JobManager doesn't bring up new TaskManager during failure recovery. Some logs and information of the Flink job are pasted below. Can you take a look and give me some guidance? Thank you so much!

 

Flink version: 1.13.2

Deploy mode: K8s native

Timeline of the bug:
 # Flink job start to work with 8 taskmanagers.
 # At {*}2022-04-17 00:28:15,286{*}, this job got an error and JobManager decided to restart 2 tasks (pod stream-1376a7c25e714f06b2ca818af964c45c-taskmanager-1-1, stream-1376a7c25e714f06b2ca818af964c45c-taskmanager-1-7)
 # The two old pod is stopped and JobManager created 2 pod (pod stream-1376a7c25e714f06b2ca818af964c45c-taskmanager-1-9, stream-1376a7c25e714f06b2ca818af964c45c-taskmanager-1-10) at *2022-04-17 00:33:15,376*
 # JobManager discard two new pods’ registration at *2022-04-17 00:33:32,393*
 # These new pods exited at {*}2022-04-17 00:33:32,396{*}, due to the rejection of registration.
 # JobManager didn’t bring up new pods and print error “Slot request bulk is not fulfillable! Could not allocate the required slot within slot request timeout” over and over



--
This message was sent by Atlassian Jira
(v8.20.7#820007)