You are viewing a plain text version of this content. The canonical link for it is here.
Posted to yarn-issues@hadoop.apache.org by "Rohith Sharma K S (JIRA)" <ji...@apache.org> on 2016/06/21 10:42:58 UTC

[jira] [Commented] (YARN-4862) Handle duplicate completed containers in RMNodeImpl

    [ https://issues.apache.org/jira/browse/YARN-4862?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15341538#comment-15341538 ] 

Rohith Sharma K S commented on YARN-4862:
-----------------------------------------

Hi [~jianhe], apologies for long delay!!
  In a positive case flow is NM inform RM that container is finished intern RM wait for AM to pull finished containers and after AM pulls the finished containers RM informs to NM that remove from NMContext.

In preemption flow, 
# RM preempt the containers which inform RMContainerImpl first that KillContainer. 
# In KillContainer#transistion, informs the RMnodeImpl to cleanUpTheContainers and also inform RMAppAttemptImpl that add to JustFinishedContainers so that let AM pulls finished containers on next heartbeat. It is assumedthat containersToCleanUp will be sent first to NM and later containersToBeRemovedFromNM is sent next heartbeat of NM. 

I see that there is *potential container leak in NodeManager module* in preemption flow. There can be situation where {{containersToCleanUp }} and {{containersToBeRemovedFromNM }} can go together in the same heartbeat. If same containerId details sent to NM together, then container will never-ever removed in NMContext.

CC :/ [~jlowe]  Basically I feel it is bug from RM that should inform back to RMNode if rmContainer is null whenever finished containers are received from NM 


And for this JIRA, I think current patch approach should be fine if we fix the above mentioned issue. Thoughts?

> Handle duplicate completed containers in RMNodeImpl
> ---------------------------------------------------
>
>                 Key: YARN-4862
>                 URL: https://issues.apache.org/jira/browse/YARN-4862
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: resourcemanager
>            Reporter: Rohith Sharma K S
>            Assignee: Rohith Sharma K S
>         Attachments: 0001-YARN-4862.patch, 0002-YARN-4862.patch
>
>
> As per [comment|https://issues.apache.org/jira/browse/YARN-4852?focusedCommentId=15209689&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15209689] from [~sharadag], there should be safe guard for duplicated container status in RMNodeImpl before creating UpdatedContainerInfo. 
> Or else in heavily loaded cluster where event processing is gradually slow, if any duplicated container are sent to RM(may be bug in NM also), there is significant impact that RMNodImpl always create UpdatedContainerInfo for duplicated containers. This result in increase in the heap memory and causes problem like YARN-4852.
> This is an optimization for issue kind YARN-4852



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: yarn-issues-help@hadoop.apache.org