You are viewing a plain text version of this content. The canonical link for it is here.
Posted to yarn-issues@hadoop.apache.org by "Dustin Cote (JIRA)" <ji...@apache.org> on 2016/01/06 15:47:39 UTC

[jira] [Updated] (YARN-3934) Application with large ApplicationSubmissionContext can cause RM to exit when ZK store is used

     [ https://issues.apache.org/jira/browse/YARN-3934?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Dustin Cote updated YARN-3934:
------------------------------
    Attachment: YARN-3934-1.patch

Here's a first attempt at the fix.  We cannot know with certainty what ZK has set for jute.maxbuffer on the server side, so we have to make the assumption that it matches what is on the client side (in this case the RM).  I've setup the code to read the property as a system property which is how we normally specify it.  There may be a desire to standardize it into the YARN config later on, but I think that's outside the scope of fixing this.  Without the patch, the ZK connection is broken and retried by default *1000* times, so the RM doesn't go down for awhile and all applications are blocked from submission.  I think it's probably worth revisiting that default value as well, but I'd like some feedback from reviewers on that if we should open a separate JIRA there.

> Application with large ApplicationSubmissionContext can cause RM to exit when ZK store is used
> ----------------------------------------------------------------------------------------------
>
>                 Key: YARN-3934
>                 URL: https://issues.apache.org/jira/browse/YARN-3934
>             Project: Hadoop YARN
>          Issue Type: Bug
>            Reporter: Ming Ma
>            Assignee: Dustin Cote
>         Attachments: YARN-3934-1.patch
>
>
> Use the following steps to test.
> 1. Set up ZK as the RM HA store.
> 2. Submit a job that refers to lots of distributed cache files with long HDFS path, which will cause the app state size to exceed ZK's max object size limit.
> 3. RM can't write to ZK and exit with the following exception.
> {noformat}
> 2015-07-10 22:21:13,002 FATAL org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Received a org.apache.hadoop.yarn.server.resourcemanager.RMFatalEvent of type STATE_STORE_OP_FAILED. Cause:
> org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = Session expired
>         at org.apache.zookeeper.KeeperException.create(KeeperException.java:127)
>         at org.apache.zookeeper.ZooKeeper.multiInternal(ZooKeeper.java:935)
>         at org.apache.zookeeper.ZooKeeper.multi(ZooKeeper.java:915)
>         at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:944)
>         at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:941)
>         at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:1083)
> {noformat}
> In this case, RM could have rejected the app during submitApplication RPC if the size of ApplicationSubmissionContext is too large.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)