You are viewing a plain text version of this content. The canonical link for it is here.
Posted to yarn-issues@hadoop.apache.org by "zhuqi (Jira)" <ji...@apache.org> on 2019/09/29 08:33:00 UTC
[jira] [Reopened] (YARN-2368) ResourceManager failed when ZKRMStateStore tries to update znode data larger than 1MB

     [ https://issues.apache.org/jira/browse/YARN-2368?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

zhuqi reopened YARN-2368:
-------------------------

We can not only set the yarn.resourcemanager.zk-jutemaxbuffer-bytes to be configured, but also can control the if we want to retry or just log some application info to define what application cause the boom of the zk buffer. So that we can make sure the gc problem not happen when we retry too much and time out the zk connection. Also we can find the root application which cause the boom of the zk buffer.

> ResourceManager failed when ZKRMStateStore tries to update znode data larger than 1MB
> -------------------------------------------------------------------------------------
>
>                 Key: YARN-2368
>                 URL: https://issues.apache.org/jira/browse/YARN-2368
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: resourcemanager
>    Affects Versions: 2.4.1
>            Reporter: Leitao Guo
>            Assignee: zhuqi
>            Priority: Critical
>         Attachments: YARN-2368.patch
>
>
> Both ResouceManagers throw out STATE_STORE_OP_FAILED events and failed finally. ZooKeeper log shows that ZKRMStateStore tries to update a znode larger than 1MB, which is the default configuration of ZooKeeper server and client in 'jute.maxbuffer'.
> ResourceManager (ip addr: 10.153.80.8) log shows as the following:
> {code}
> 2014-07-25 22:33:11,078 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: ZKRMStateStore Session connected
> 2014-07-25 22:33:11,078 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: ZKRMStateStore Session restored
> 2014-07-25 22:33:11,214 FATAL org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Received a org.apache.hadoop.yarn.server.resourcemanager.RMFatalEvent of type STATE_STORE_OP_FAILED. Cause:
> org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /rmstore/ZKRMStateRoot/RMAppRoot/application_1406264354826_1645/appattempt_1406264354826_1645_000001
>         at org.apache.zookeeper.KeeperException.create(KeeperException.java:99)
>         at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
>         at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1041)
>         at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1069)
>         at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$8.run(ZKRMStateStore.java:926)
>         at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$8.run(ZKRMStateStore.java:923)
>         at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:973)
>         at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:992)
>         at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.existsWithRetries(ZKRMStateStore.java:923)
>         at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.updateApplicationAttemptStateInternal(ZKRMStateStore.java:620)
>         at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:675)
>         at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:766)
>         at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:761)
>         at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173)
>         at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106)
>         at java.lang.Thread.run(Thread.java:745)
> {code}
> Meanwhile, ZooKeeps log shows as the following:
> {code}
> 2014-07-25 22:10:09,728 [myid:1] - INFO  [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@197] - Accepted socket connection from /10.153.80.8:58890
> 2014-07-25 22:10:09,730 [myid:1] - INFO  [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:ZooKeeperServer@832] - Client attempting to renew session 0x247684586e70006 at /10.153.80.8:58890
> 2014-07-25 22:10:09,730 [myid:1] - INFO  [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:Learner@107] - Revalidating client: 0x247684586e70006
> 2014-07-25 22:10:09,730 [myid:1] - INFO  [QuorumPeer[myid=1]/0.0.0.0:2181:ZooKeeperServer@595] - Established session 0x247684586e70006 with negotiated timeout 10000 for client /10.153.80.8:58890
> 2014-07-25 22:10:09,730 [myid:1] - INFO  [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:ZooKeeperServer@863] - got auth packet /10.153.80.8:58890
> 2014-07-25 22:10:09,730 [myid:1] - INFO  [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:ZooKeeperServer@897] - auth success /10.153.80.8:58890
> 2014-07-25 22:10:09,742 [myid:1] - WARN  [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@354] - Exception causing close of session 0x247684586e70006 due to java.io.IOException: Len error 1530
> 747
> 2014-07-25 22:10:09,743 [myid:1] - INFO  [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1001] - Closed socket connection for client /10.153.80.8:58890 which had sessionid 0x247684586e70006
> ... ...
> 2014-07-25 22:33:10,966 [myid:1] - WARN  [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@354] - Exception causing close of session 0x247684586e70006 due to java.io.IOException: Len error 1530747
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: yarn-issues-help@hadoop.apache.org