You are viewing a plain text version of this content. The canonical link for it is here.
Posted to yarn-issues@hadoop.apache.org by "Jason Lowe (JIRA)" <ji...@apache.org> on 2018/05/03 14:16:00 UTC

[jira] [Commented] (YARN-8242) YARN NM: OOM error while reading back the state store on recovery

    [ https://issues.apache.org/jira/browse/YARN-8242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16462503#comment-16462503 ] 

Jason Lowe commented on YARN-8242:
----------------------------------

Thanks for the report and the patch!

I'm not a fan of exposing leveldb-specifics out of the leveldb NM state store.  It makes it much harder to replace the state store with something else.  Essentially all we need here is an iterable abstraction of the recovery state, and we don't need to expose a leveldb iterator to do that directly.

The state store has a getIterator method but it's hardcoded to iterate only container state.  That's confusing, since the state has a lot more than just container state in it.  Rather than simply expose one iterator, and specifically a leveldb iterator, I think it would be much cleaner to have loadContainerState return an Iterator<RecoveredContainerState> and callers can iterate through the loaded containers.  The state store can have a helper class that implements the Iterator interface but hides the leveldb details from the caller.  A similar approach can be used for other recovered lists like application state, localized resources, etc. if it's worth it for those as well.

The null state store should return a valid iterator that has no elements to iterate (e.g.: Collections.emptyIterator) rather than null.  The latter is going to lead to a lot of NPEs in unit tests or unnecessary null checks in the main code.

A significant amount of changes in the patch are a result of whitespace reformatting unrelated to the nature of the patch and should be removed.  In addition the wildcard imports should be removed.


> YARN NM: OOM error while reading back the state store on recovery
> -----------------------------------------------------------------
>
>                 Key: YARN-8242
>                 URL: https://issues.apache.org/jira/browse/YARN-8242
>             Project: Hadoop YARN
>          Issue Type: Improvement
>          Components: yarn
>    Affects Versions: 3.2.0
>            Reporter: Kanwaljeet Sachdev
>            Assignee: Kanwaljeet Sachdev
>            Priority: Blocker
>         Attachments: YARN-8242.001.patch
>
>
> On startup the NM reads its state store and builds a list of application in the state store to process. If the number of applications in the state store is large and have a lot of "state" connected to it the NM can run OOM and never get to the point that it can start processing the recovery.
> Since it never starts the recovery there is no way for the NM to ever pass this point. It will require a change in heap size to get the NM started.
>  
> Following is the stack trace
> {code:java}
> at java.lang.OutOfMemoryError.<init> (OutOfMemoryError.java:48) at com.google.protobuf.ByteString.copyFrom (ByteString.java:192) at com.google.protobuf.CodedInputStream.readBytes (CodedInputStream.java:324) at org.apache.hadoop.yarn.proto.YarnProtos$StringStringMapProto.<init> (YarnProtos.java:47069) at org.apache.hadoop.yarn.proto.YarnProtos$StringStringMapProto.<init> (YarnProtos.java:47014) at org.apache.hadoop.yarn.proto.YarnProtos$StringStringMapProto$1.parsePartialFrom (YarnProtos.java:47102) at org.apache.hadoop.yarn.proto.YarnProtos$StringStringMapProto$1.parsePartialFrom (YarnProtos.java:47097) at com.google.protobuf.CodedInputStream.readMessage (CodedInputStream.java:309) at org.apache.hadoop.yarn.proto.YarnProtos$ContainerLaunchContextProto.<init> (YarnProtos.java:41016) at org.apache.hadoop.yarn.proto.YarnProtos$ContainerLaunchContextProto.<init> (YarnProtos.java:40942) at org.apache.hadoop.yarn.proto.YarnProtos$ContainerLaunchContextProto$1.parsePartialFrom (YarnProtos.java:41080) at org.apache.hadoop.yarn.proto.YarnProtos$ContainerLaunchContextProto$1.parsePartialFrom (YarnProtos.java:41075) at com.google.protobuf.CodedInputStream.readMessage (CodedInputStream.java:309) at org.apache.hadoop.yarn.proto.YarnServiceProtos$StartContainerRequestProto.<init> (YarnServiceProtos.java:24517) at org.apache.hadoop.yarn.proto.YarnServiceProtos$StartContainerRequestProto.<init> (YarnServiceProtos.java:24464) at org.apache.hadoop.yarn.proto.YarnServiceProtos$StartContainerRequestProto$1.parsePartialFrom (YarnServiceProtos.java:24568) at org.apache.hadoop.yarn.proto.YarnServiceProtos$StartContainerRequestProto$1.parsePartialFrom (YarnServiceProtos.java:24563) at com.google.protobuf.AbstractParser.parsePartialFrom (AbstractParser.java:141) at com.google.protobuf.AbstractParser.parseFrom (AbstractParser.java:176) at com.google.protobuf.AbstractParser.parseFrom (AbstractParser.java:188) at com.google.protobuf.AbstractParser.parseFrom (AbstractParser.java:193) at com.google.protobuf.AbstractParser.parseFrom (AbstractParser.java:49) at org.apache.hadoop.yarn.proto.YarnServiceProtos$StartContainerRequestProto.parseFrom (YarnServiceProtos.java:24739) at org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService.loadContainerState (NMLeveldbStateStoreService.java:217) at org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService.loadContainersState (NMLeveldbStateStoreService.java:170) at org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.recover (ContainerManagerImpl.java:253) at org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.serviceInit (ContainerManagerImpl.java:237) at org.apache.hadoop.service.AbstractService.init (AbstractService.java:163) at org.apache.hadoop.service.CompositeService.serviceInit (CompositeService.java:107) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit (NodeManager.java:255) at org.apache.hadoop.service.AbstractService.init (AbstractService.java:163) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager (NodeManager.java:474) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.main (NodeManager.java:521){code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: yarn-issues-help@hadoop.apache.org