You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hadoop.apache.org by Nand kishor Bansal <nk...@gmail.com> on 2019/10/23 13:00:43 UTC
OOM in Yarn NodeManager due to several failed applications and application attempts

Hi,

I'm running a 3 node yarn cluster (3 RM's and 3 NM's) to deploy Samza
Applications. Due to some temporary disruptions I ended up having several
failed application attempts. 2 of the Node Managers accumulated the states
of all these failed attempts and they attempt to clear the states of these
attempts and send the update to Active Resource Manager. This caused 3
problems:

1. Yarn NM's which were configured with Xmx=512m couldn't hold all these
states in the memory and started going OOM. After few trial and errors I
changed the Xmx to 4096m to get moving forward.

2019-10-23 11:12:41,935 ERROR
org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Unexpected
error starting NodeStatusUpdater
org.apache.hadoop.ipc.RemoteException(java.lang.OutOfMemoryError): Java
heap space

        at org.apache.hadoop.ipc.Client.call(Client.java:1504)
        at org.apache.hadoop.ipc.Client.call(Client.java:1441)
        at
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:231)
        at com.sun.proxy.$Proxy30.registerNodeManager(Unknown Source)
        at
org.apache.hadoop.yarn.server.api.impl.pb.client.ResourceTrackerPBClientImpl.registerNodeManager(ResourceTrackerPBClientImpl.java:68)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)


2. After OOM issue was fixed the NM failed to send the updated states to RM
due to RPC payload limit of 64mb (ipc.maximum.data.length=67108864). To get
moving forward I changed this limit to 128mb (134217728).

2019-10-23 10:57:25,539 INFO org.apache.hadoop.ipc.Server: Socket Reader #1
for port 8025: readAndProcess from client 10.200.200.73 threw exception
[java.io.IOException: Requested data length 107771315 is longer than
maximum configured RPC length 67108864.  RPC came from 10.200.200.73]
java.io.IOException: Requested data length 107771315 is longer than maximum
configured RPC length 67108864.  RPC came from 10.200.200.73
        at
org.apache.hadoop.ipc.Server$Connection.checkDataLength(Server.java:1657)
        at
org.apache.hadoop.ipc.Server$Connection.readAndProcess(Server.java:1719)
        at org.apache.hadoop.ipc.Server$Listener.doRead(Server.java:930)
        at
org.apache.hadoop.ipc.Server$Listener$Reader.doRunLoop(Server.java:786)
        at org.apache.hadoop.ipc.Server$Listener$Reader.run(Server.java:757)


3. Now the active RM started getting OOM. To get around that situation I
changed the RM Xmx value from 256m to 1024m.
2019-10-23 11:08:33,513 WARN org.apache.hadoop.ipc.Server: IPC Server
handler 0 on 8025, call
org.apache.hadoop.yarn.server.api.ResourceTrackerPB.registerNodeManager
from 10.200.200.71:58416 Call#0 Retry#0
java.lang.OutOfMemoryError: Java heap space
        at com.google.protobuf.ByteString.copyFrom(ByteString.java:192)
        at
com.google.protobuf.CodedInputStream.readBytes(CodedInputStream.java:324)
        at
org.apache.hadoop.yarn.proto.YarnServerCommonServiceProtos$NMContainerStatusProto.<init>(YarnServerCommonServiceProtos.java:8630)
        at
org.apache.hadoop.yarn.proto.YarnServerCommonServiceProtos$NMContainerStatusProto.<init>(YarnServerCommonServiceProtos.java:8530)
        at
org.apache.hadoop.yarn.proto.YarnServerCommonServiceProtos$NMContainerStatusProto$1.parsePartialFrom(YarnServerCommonServiceProtos.java:8673)


The list of applications tracked by Yarn. Please note that not all these
applications are active.
[zk: platform1:2181,platform2:2181,platform3:2181(CONNECTED) 0] ls
/rmstore/ZKRMStateRoot/RMAppRoot
[application_1569716692464_0001, application_1560384459488_0001,
application_1562056525389_0001, application_1563217931091_0001,
application_1569110685650_0001, application_1568967214631_0001,
application_1571715969755_0001, application_1568967214631_0002,
application_1567297673974_0001, application_1561421320014_0001,
application_1557630905716_0001, application_1568966433994_0001,
application_1565741380298_0001, application_1566370160076_0001,
application_1568678895608_0001, application_1558843260172_0001,
application_1567988699825_0001, application_1571697882625_0001,
application_1559261057600_0001, application_1571716977251_0001,
application_1562803856474_0001, application_1571716977251_0002,
application_1571828313911_0001, application_1571828313911_0002,
application_1563753756586_0001, application_1564012947273_0001,
application_1568964760196_0001]

After above changes both RM and NM started behaving normally and I was able
to deploy the samza application successfully.

Are there better ways to recover from above situation?
In a production system I can't give this much (4096m to NM and 1024m to RM)
Xmx to Yarn components.
Does it look like an issue with Hadoop Yarn?

Thanks,
Nand Kishor Bansal