You are viewing a plain text version of this content. The canonical link for it is here.
Posted to yarn-issues@hadoop.apache.org by "Rohith Sharma K S (JIRA)" <ji...@apache.org> on 2016/03/22 11:33:25 UTC

[jira] [Comment Edited] (YARN-4852) Resource Manager Ran Out of Memory

    [ https://issues.apache.org/jira/browse/YARN-4852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15206113#comment-15206113 ] 

Rohith Sharma K S edited comment on YARN-4852 at 3/22/16 10:32 AM:
-------------------------------------------------------------------

[~slukog] Can you give more information to verify why there was immediate glitch
# Any NM got restarted? If so how many and how many containers were running in each NM.?
# Was there RM heavily loaded or any deadlock in scheduler where most of the node heart beat was not processed by scheduler?
# Do you have Jstack report for RM  while memory is increasing?

These container status are cleared from nodeUpdateQueue when node heartbeat is processed by scheduler. If there is any issue/slow from scheduler, node status events would pile up. This would increase nodeUpdateQueue size and might cause OOM.


was (Author: rohithsharma):
[~slukog] Can you more information to verify why there was immediate glitch
# Any NM got restarted? If so how many and how many containers were running in each NM.?
# Was there RM heavily loaded or any deadlock in scheduler where most of the node heart beat was not processed by scheduler?
# Do you have Jstack report for RM  while memory is increasing?

These container status are cleared from nodeUpdateQueue when node heartbeat is processed by scheduler. If there is any issue/slow from scheduler, these status would pile up and cause OOM.

> Resource Manager Ran Out of Memory
> ----------------------------------
>
>                 Key: YARN-4852
>                 URL: https://issues.apache.org/jira/browse/YARN-4852
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: resourcemanager
>    Affects Versions: 2.6.0
>            Reporter: Gokul
>
> Resource Manager went out of memory (max heap size: 8 GB, CMS GC) and shut down itself. 
> Heap dump analysis reveals that 1200 instances of RMNodeImpl class hold 86% of memory. When digged deep, there are around 0.5 million objects of UpdatedContainerInfo (nodeUpdateQueue inside RMNodeImpl). This in turn contains around 1.7 million objects of YarnProtos$ContainerIdProto, ContainerStatusProto, ApplicationAttemptIdProto, ApplicationIdProto each of which retain around 1 GB heap.
> Full GC was triggered multiple times when RM went OOM and only 300 MB of heap was released. So all these objects look like live objects.
> RM's usual heap usage is around 4 GB but it suddenly spiked to 8 GB in 20 mins time and went OOM.
> There are no spike in job submissions, container numbers at the time of issue occurrence. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)