You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by "Zhenqiu Huang (JIRA)" <ji...@apache.org> on 2018/11/20 05:53:00 UTC

[jira] [Commented] (FLINK-10868) Flink's Yarn ResourceManager doesn't use yarn.maximum-failed-containers as limit of resource acquirement

    [ https://issues.apache.org/jira/browse/FLINK-10868?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16692708#comment-16692708 ] 

Zhenqiu Huang commented on FLINK-10868:
---------------------------------------

[~till.rohrmann]

I am working on a fix in FlinkYarnResourceManager. In PerJob cluster mode, as mini dispatch will kill itself once the only job stops, it should be easy to stop the cluster by kill the only JobMaster registered in RM with JobMasterGateway. But in session mode, I can only stop each of registered JobMaster when failed containers larger than the threshold set in configuration. Do you have any suggestion to stop session cluster gracefully?

> Flink's Yarn ResourceManager doesn't use yarn.maximum-failed-containers as limit of resource acquirement
> --------------------------------------------------------------------------------------------------------
>
>                 Key: FLINK-10868
>                 URL: https://issues.apache.org/jira/browse/FLINK-10868
>             Project: Flink
>          Issue Type: Bug
>          Components: YARN
>    Affects Versions: 1.6.2, 1.7.0
>            Reporter: Zhenqiu Huang
>            Assignee: Zhenqiu Huang
>            Priority: Major
>
> Currently, YarnResourceManager does use yarn.maximum-failed-containers as limit of resource acquirement. In worse case, when new start containers consistently fail, YarnResourceManager will goes into an infinite resource acquirement process without failing the job. Together with the https://issues.apache.org/jira/browse/FLINK-10848, It will quick occupy all resources of yarn queue.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)