You are viewing a plain text version of this content. The canonical link for it is here.
Posted to yarn-issues@hadoop.apache.org by "Varun Vasudev (JIRA)" <ji...@apache.org> on 2016/07/12 14:59:21 UTC
[jira] [Commented] (YARN-4876) [Phase 1] Decoupled Init / Destroy of Containers from Start / Stop

    [ https://issues.apache.org/jira/browse/YARN-4876?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15373009#comment-15373009 ] 

Varun Vasudev commented on YARN-4876:
-------------------------------------

Thanks for all the work [~asuresh] and [~marco.rabozzi]. My apologies for the delay in reviewing the patch. Some comments -

1) What do think about renaming C_AWAITING_START to READY and ContainerState.AWAITING_START to ContainerState.READY?

2) Rename rename multistart/restart to something else. I think multiStart is awkward. Maybe something like extendedLifecycle?

3) We should retain the two params StartedContainer constructor otherwise we ended touching all the StartedContainer creations

4)
{code}
+  @Override
+  public Map<String, ByteBuffer> initializeContainer(Container container,
+      ContainerLaunchContext containerLaunchContext)
+          throws YarnException, IOException {
+
+    StartedContainer startingContainer =
+        new StartedContainer(container.getId(), container.getNodeId(), true);
{code}
Rename startingContainer to initializingContainer

5)
{code}
+  @Override
+  public Map<String, ByteBuffer> initializeContainer(Container container,
+      ContainerLaunchContext containerLaunchContext)
+          throws YarnException, IOException {

+        if (!startingContainer.allowsRestart()) {
+          throw RPCUtil.getRemoteException(
+              "Container " + startingContainer.containerId.toString()
+                  + " already initialized and does not allow reinitialization");
+        }
{code}
In this context, what does re-initialization mean? Should an app be allowed to call initialize twice?

6)
{code}
+  public void destroyContainer(ContainerId containerId, NodeId nodeId)
+      throws YarnException, IOException {

+    if (startedContainer != null) {
+      synchronized (startedContainer) {
{code}

This looks like it might be a race condition. Should we synchronize on startedContainer before checking if it's null?

7)
{code}
+        destroyContainerInternal(containerId, nodeId);
{code}
I think the destroyContainerInternal call can be moved out of synchronized block?

8)
{code}
+    if(launchContext != null || !isRestart) {
+      Map<String, ByteBuffer> serviceData = getAuxServiceMetaData();
+      if (launchContext.getServiceData()!=null &&
+          !launchContext.getServiceData().isEmpty()) {
+        for (Entry<String, ByteBuffer> meta : launchContext.getServiceData()
+            .entrySet()) {
+          if (null == serviceData.get(meta.getKey())) {
+            throw new InvalidAuxServiceException("The auxService:" + meta.getKey()
+                + " does not exist");
+          }
{code}

There's a potential NPE here - launchContext can be null and isRestart can be false leading to a NPE

9)
{code}
+  private long verifyAndGetDestroyDelay(
+      InitializeContainerRequest request,
+      ContainerTokenIdentifier containerTokenIdentifier)
+      throws YarnException {
+
+    ContainerId containerId = containerTokenIdentifier.getContainerID();
+    String containerIdStr = containerId.toString();
+    String user = containerTokenIdentifier.getApplicationSubmitter();
+
+    ApplicationId applicationID =
+        containerId.getApplicationAttemptId().getApplicationId();
+
+    long destroyDelay = request.getDestroyDelay();
+
+    if (destroyDelay < 0) {
+      NMAuditLogger.logFailure(user, AuditConstants.INITIALIZE_CONTAINER,
+          "ContainerManagerImpl", "Invalid destroy delay value", applicationID,
+          containerId);
+      throw RPCUtil.getRemoteException(
+          "Container " + containerIdStr + " invalid destroy delay value");
+    } else if (destroyDelay == 0) {
+      destroyDelay = this.getConfig().getLong(
+          YarnConfiguration.NM_CONTAINER_DESTROY_DELAY_MS,
+          YarnConfiguration.DEFAULT_NM_CONTAINER_DESTROY_DELAY_MS);
+    }
+
+    return destroyDelay;
+  }
{code}

I think we need to add a check that the destoryDelay isn't longer than the max destroy delay.

10)
{code}
+
+    LOG.info("Initialize request for " + containerIdStr + " by user " + user);
+    LOG.info("Reinitializing container " + containerIdStr);
{code}

and

{code}
+    LOG.info("Initialize request for " + containerIdStr + " by user " + user);
+    LOG.info("Initializing new container " + containerIdStr);
{code}

In both case, we don't need two logging statements - we should merge them into one.

11)
{code}
+    LOG.info("Start request for " + containerIdStr + " by user " + user);
{code}
In restartContainerInternal, message should say restart?

12)
{code}
+    metrics.launchedContainer();
+    metrics.allocateContainer(containerTokenIdentifier.getResource());
{code}
I suspect this will lead to double counting? If a container is restarted multiple times it'll be recorded as multiple launches.

13)
{code}
+        if(container.allowsRestart()) {
+          stopContainerInternal(id);
+        } else {
+          destroyContainerInternal(id);
+        }
{code}

I'm wondering if we should change this so that the code will just call stopContainerInternal which in turn should call destroy if necessary. What do you think?

14)
{code}
+               ApplicationEventType.START_CONTAINER,
+               new StartContainerTransition())
+           .addTransition(ApplicationState.RUNNING,
+               ApplicationState.RUNNING,
                ApplicationEventType.APPLICATION_CONTAINER_FINISHED,
{code}
The second transition looks wrong.

15)
{code}
+  @Override
+  public boolean allowsRestart() {
+    return destroyDelay > 0;
+  }
{code}

I'm not sure this is correct. Are we using destoryDelay as a proxy for whether restart is being allowed?

16)
{code}
+      // for container that allows restart we should
+      // avoid trying to recover again
+      if(allowsRestart()) {
+        recoveredStatus = RecoveredContainerStatus.REQUESTED;
+      }
{code}

Can you explain why we shouldn't try to recover launched containers?

17)
{code}
+      if(recordedWorkDir != null) {
+        containerWorkDir = new Path(recordedWorkDir);
+        // remove content of working directory for a clean start
+        // since data from a previous run might still be present
+        LOG.info("Removing working directory: " + recordedWorkDir
+            + "  for container: " + containerIdStr);
+        lfs.delete(containerWorkDir, true);
+
+        // TODO: let the application master specify a policy for data removal
{code}

I think this has to changed significantly - we can't delete everything in the container work directory. We should clean up the resources that were localized by the container.

18)
{code}
-  public void cleanupContainer() throws IOException {
+  public void cleanupContainer(boolean storeAsCompleted) throws IOException {
     ContainerId containerId = container.getContainerId();
     String containerIdStr = ConverterUtils.toString(containerId);
     LOG.info("Cleaning up container " + containerIdStr);
 
-    try {
-      context.getNMStateStore().storeContainerKilled(containerId);
-    } catch (IOException e) {
-      LOG.error("Unable to mark container " + containerId
-          + " killed in store", e);
+    if(storeAsCompleted) {    
+      try {
+        context.getNMStateStore().storeContainerKilled(containerId);
+      } catch (IOException e) {
+        LOG.error("Unable to mark container " + containerId
+            + " killed in store", e);
+      }
     }
{code}
Any reason for changing the cleanupContainer function? If storeAsCompleted is false, we should just not call cleanupContainer and avoid modifying the implementation of cleanupContainer?





> [Phase 1] Decoupled Init / Destroy of Containers from Start / Stop
> ------------------------------------------------------------------
>
>                 Key: YARN-4876
>                 URL: https://issues.apache.org/jira/browse/YARN-4876
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>            Reporter: Arun Suresh
>            Assignee: Marco Rabozzi
>         Attachments: YARN-4876-design-doc.pdf, YARN-4876.002.patch, YARN-4876.01.patch
>
>
> Introduce *initialize* and *destroy* container API into the *ContainerManagementProtocol* and decouple the actual start of a container from the initialization. This will allow AMs to re-start a container without having to lose the allocation.
> Additionally, if the localization of the container is associated to the initialize (and the cleanup with the destroy), This can also be used by applications to upgrade a Container by *re-initializing* with a new *ContainerLaunchContext*



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: yarn-issues-help@hadoop.apache.org