You are viewing a plain text version of this content. The canonical link for it is here.

Posted to yarn-issues@hadoop.apache.org by "Zhijie Shen (JIRA)" <ji...@apache.org> on 2014/08/18 22:45:21 UTC

[jira] [Commented] (YARN-2249) AM release request may be lost on RM restart

    [ https://issues.apache.org/jira/browse/YARN-2249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14101250#comment-14101250 ] 

Zhijie Shen commented on YARN-2249:
-----------------------------------

1. Do the following in AbstractYarnScheduler.serviceInit?
{code}
+    super.nmExpireInterval =
+        conf.getInt(YarnConfiguration.RM_NM_EXPIRY_INTERVAL_MS,
+          YarnConfiguration.DEFAULT_RM_NM_EXPIRY_INTERVAL_MS);
{code}
{code}
+    createReleaseCache();
{code}

2. Add RM_NM_EXPIRY_INTERVAL_MS in yarn-default.xml?

3. Not sure it's going to be an efficient data structure. Different apps' containers should not affect each other, right? "mutex" on the whole collection seems to be a too coarse granularity (blocking allocate call). Should we use Map<AppAttemptId, List<ContainerId>> and make each app have separate mutex?
{code}
+  private Set<ContainerId> pendingRelease = null;
+  private final Object mutex = new Object();
{code}

> AM release request may be lost on RM restart
> --------------------------------------------
>
>                 Key: YARN-2249
>                 URL: https://issues.apache.org/jira/browse/YARN-2249
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: resourcemanager
>            Reporter: Jian He
>            Assignee: Jian He
>         Attachments: YARN-2249.1.patch, YARN-2249.1.patch, YARN-2249.2.patch, YARN-2249.2.patch, YARN-2249.3.patch, YARN-2249.4.patch
>
>
> AM resync on RM restart will send outstanding container release requests back to the new RM. In the meantime, NMs report the container statuses back to RM to recover the containers. If RM receives the container release request  before the container is actually recovered in scheduler, the container won't be released and the release request will be lost.



--
This message was sent by Atlassian JIRA
(v6.2#6252)