You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@flink.apache.org by "Zhenqiu Huang (JIRA)" <ji...@apache.org> on 2018/11/13 17:16:00 UTC
[jira] [Created] (FLINK-10868) Flink's Yarn ResourceManager doesn't
use yarn.maximum-failed-containers as limit of resource acquirement
Zhenqiu Huang created FLINK-10868:
-------------------------------------
Summary: Flink's Yarn ResourceManager doesn't use yarn.maximum-failed-containers as limit of resource acquirement
Key: FLINK-10868
URL: https://issues.apache.org/jira/browse/FLINK-10868
Project: Flink
Issue Type: Bug
Components: YARN
Affects Versions: 1.6.2, 1.7.0
Reporter: Zhenqiu Huang
Assignee: Zhenqiu Huang
Currently, YarnResourceManager does use yarn.maximum-failed-containers as limit of resource acquirement. In worse case, when new start containers consistently fail, YarnResourceManager will goes into an infinite resource acquirement process without failing the job. Together with the https://issues.apache.org/jira/browse/FLINK-10848, It will quick occupy all resources of yarn queue.
In production, we observe the following that a task manager failed in HA enabled Flink job. At the same time, there is a hdfs failover. During that period, Operation category READ is not supported in state standby. Thus, new acquired task managers keep on failure.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)