You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by "zhanglu153 (Jira)" <ji...@apache.org> on 2022/11/19 09:39:00 UTC
[jira] [Created] (FLINK-30095) Flink's JobCluster ResourceManager should throw an exception when the failure number of starting worker reaches the maximum failure rate
zhanglu153 created FLINK-30095:
----------------------------------
Summary: Flink's JobCluster ResourceManager should throw an exception when the failure number of starting worker reaches the maximum failure rate
Key: FLINK-30095
URL: https://issues.apache.org/jira/browse/FLINK-30095
Project: Flink
Issue Type: Improvement
Affects Versions: 1.16.0, 1.15.0, 1.14.0, 1.13.0
Reporter: zhanglu153
As shown in https://issues.apache.org/jira/browse/FLINK-10868,although resourcemanager.start-worker.max-failure-rate and resourcemanager.start-worker.retry-interval are set, in a worse case, when new start containers consistently fail, YarnResourceManager will goes into an infinite resource acquirement process without failing the job. Resources on Yarn are continuously occupied and released after a period of time, affecting other tasks.
It should be considered that when the failure number of starting worker reaches the maximum failure rate, Flink JobCluster ResourceManager will directly throw an exception instead of sending a new request to start new worker after a period of time. This task does not fail but is always in the running state. Users may not be aware that tasks occupy resources on yarn in a timely manner, which affects other tasks' failure to obtain resources on yarn.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)