You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@flink.apache.org by "Zhenqiu Huang (JIRA)" <ji...@apache.org> on 2019/02/02 17:16:00 UTC

[jira] [Commented] (FLINK-10868) Flink's JobCluster ResourceManager doesn't use maximum-failed-containers as limit of resource acquirement

    [ https://issues.apache.org/jira/browse/FLINK-10868?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16759111#comment-16759111 ] 

Zhenqiu Huang commented on FLINK-10868:
---------------------------------------

[~till.rohrmann]

Thanks for reviewing the PR. According to your suggestions. I changed as
 # As the feature is generic for both Yarn and Mesos, add only the maximum failure rate config option in resource manager config.
 # Put failure rate related logic into the TimestampBasedFailureRater which implements the FailureRater interface. As I don't want to mix two changes with different purpose in the same PR, we can make other code FailureRateRestartStrategy use it in another small PR.
 # For the failure rate test for RM, I tried to do in ResourceManagerTest. I found it is hard to mimic the behavior of registerSlotRequest without mocking lots components. And I also have to setup Test RM exactly like what YarnResourceManagerTest is doing. Thus, I still put the test cases separately in YarnResourceManagerTest and MesosResourceManagerTest.

> Flink's JobCluster ResourceManager doesn't use maximum-failed-containers as limit of resource acquirement
> ---------------------------------------------------------------------------------------------------------
>
>                 Key: FLINK-10868
>                 URL: https://issues.apache.org/jira/browse/FLINK-10868
>             Project: Flink
>          Issue Type: Bug
>          Components: Mesos, YARN
>    Affects Versions: 1.6.2, 1.7.0
>            Reporter: Zhenqiu Huang
>            Assignee: Zhenqiu Huang
>            Priority: Major
>              Labels: pull-request-available
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> Currently, YarnResourceManager does use yarn.maximum-failed-containers as limit of resource acquirement. In worse case, when new start containers consistently fail, YarnResourceManager will goes into an infinite resource acquirement process without failing the job. Together with the https://issues.apache.org/jira/browse/FLINK-10848, It will quick occupy all resources of yarn queue.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)