You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by "ASF GitHub Bot (JIRA)" <ji...@apache.org> on 2018/09/07 16:51:00 UTC

[jira] [Commented] (FLINK-9190) YarnResourceManager sometimes does not request new Containers

    [ https://issues.apache.org/jira/browse/FLINK-9190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16607348#comment-16607348 ] 

ASF GitHub Bot commented on FLINK-9190:
---------------------------------------

tillrohrmann commented on issue #5931: [FLINK-9190][flip6,yarn] Request new container if container completed unexpectedly
URL: https://github.com/apache/flink/pull/5931#issuecomment-419500186
 
 
   Yes at the moment, this could happen. However, the superfluous `TaskManager` should be released after it idled around for too long. Moreover, I'm currently working on making the `SlotManager` aware of how many outstanding slots he has requested. That way he should not allocate additional containers in case of a failover of the `ExecutionGraph`.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


> YarnResourceManager sometimes does not request new Containers
> -------------------------------------------------------------
>
>                 Key: FLINK-9190
>                 URL: https://issues.apache.org/jira/browse/FLINK-9190
>             Project: Flink
>          Issue Type: Bug
>          Components: Distributed Coordination, YARN
>    Affects Versions: 1.5.0
>         Environment: Hadoop 2.8.3
> ZooKeeper 3.4.5
> Flink 71c3cd2781d36e0a03d022a38cc4503d343f7ff8
>            Reporter: Gary Yao
>            Assignee: Gary Yao
>            Priority: Blocker
>              Labels: flip-6, pull-request-available
>             Fix For: 1.5.0
>
>         Attachments: yarn-logs
>
>
> *Description*
> The {{YarnResourceManager}} does not request new containers if {{TaskManagers}} are killed rapidly in succession. After 5 minutes the job is restarted due to {{NoResourceAvailableException}}, and the job runs normally afterwards. I suspect that {{TaskManager}} failures are not registered if the failure occurs before the {{TaskManager}} registers with the master. Logs are attached; I added additional log statements to {{YarnResourceManager.onContainersCompleted}} and {{YarnResourceManager.onContainersAllocated}}.
> *Expected Behavior*
> The {{YarnResourceManager}} should recognize that the container is completed and keep requesting new containers. The job should run as soon as resources are available. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)