You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by "Till Rohrmann (Jira)" <ji...@apache.org> on 2019/11/07 10:32:00 UTC
[jira] [Comment Edited] (FLINK-12342) Yarn Resource Manager Acquires Too Many Containers

    [ https://issues.apache.org/jira/browse/FLINK-12342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16969146#comment-16969146 ] 

Till Rohrmann edited comment on FLINK-12342 at 11/7/19 10:31 AM:
-----------------------------------------------------------------

I fear that this issue has not been fully fixed with the latest commits but only mitigated as we are still blocking the main thread when starting the {{TaskExecutors}}. This could lead to a delay of processing incoming {{YarnResourceManager#onContainersAllocated}} messages. For more details see FLINK-13184. 


was (Author: till.rohrmann):
I fear that this issue has not been fully fixed with the latest commits as we are still blocking the main thread when starting the {{TaskExecutors}}. This could lead to a delay of processing incoming {{YarnResourceManager#onContainersAllocated}} messages. For more details see FLINK-13184. 

> Yarn Resource Manager Acquires Too Many Containers
> --------------------------------------------------
>
>                 Key: FLINK-12342
>                 URL: https://issues.apache.org/jira/browse/FLINK-12342
>             Project: Flink
>          Issue Type: Bug
>          Components: Deployment / YARN
>    Affects Versions: 1.6.4, 1.7.2, 1.8.0
>         Environment: We runs job in Flink release 1.6.3. 
>            Reporter: Zhenqiu Huang
>            Assignee: Till Rohrmann
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 1.10.0, 1.8.3, 1.9.2
>
>         Attachments: Screen Shot 2019-04-29 at 12.06.23 AM.png, container.log, flink-1.4.png, flink-1.6.png
>
>          Time Spent: 40m
>  Remaining Estimate: 0h
>
> In currently implementation of YarnFlinkResourceManager, it starts to acquire new container one by one when get request from SlotManager. The mechanism works when job is still, say less than 32 containers. If the job has 256 container, containers can't be immediately allocated and appending requests in AMRMClient will be not removed accordingly. We observe the situation that AMRMClient ask for current pending request + 1 (the new request from slot manager) containers. In this way, during the start time of such job, it asked for 4000+ containers. If there is an external dependency issue happens, for example hdfs access is slow. Then, the whole job will be blocked without getting enough resource and finally killed with SlotManager request timeout.
> Thus, we should use the total number of container asked rather than pending request in AMRMClient as threshold to make decision whether we need to add one more resource request.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)