You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by "Zhenqiu Huang (JIRA)" <ji...@apache.org> on 2019/01/15 05:47:00 UTC

[jira] [Commented] (FLINK-10868) Flink's JobCluster ResourceManager doesn't use yarn.maximum-failed-containers as limit of resource acquirement

    [ https://issues.apache.org/jira/browse/FLINK-10868?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16742763#comment-16742763 ] 

Zhenqiu Huang commented on FLINK-10868:
---------------------------------------

[~till.rohrmann]

When I test the PR in production, i found we not only need to reject all pending request, but also to reject any new slot request from slot pools. The only issue is when a user use a default fixed delay restart strategy, which is retry forever, resource manager will keep on rejecting new slot request. Do you think it is an expected behavior?

> Flink's JobCluster ResourceManager doesn't use yarn.maximum-failed-containers as limit of resource acquirement
> --------------------------------------------------------------------------------------------------------------
>
>                 Key: FLINK-10868
>                 URL: https://issues.apache.org/jira/browse/FLINK-10868
>             Project: Flink
>          Issue Type: Bug
>          Components: YARN
>    Affects Versions: 1.6.2, 1.7.0
>            Reporter: Zhenqiu Huang
>            Assignee: Zhenqiu Huang
>            Priority: Major
>              Labels: pull-request-available
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> Currently, YarnResourceManager does use yarn.maximum-failed-containers as limit of resource acquirement. In worse case, when new start containers consistently fail, YarnResourceManager will goes into an infinite resource acquirement process without failing the job. Together with the https://issues.apache.org/jira/browse/FLINK-10848, It will quick occupy all resources of yarn queue.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)