You are viewing a plain text version of this content. The canonical link for it is here.
Posted to yarn-issues@hadoop.apache.org by "Chandni Singh (JIRA)" <ji...@apache.org> on 2018/05/03 17:33:00 UTC
[jira] [Resolved] (YARN-8231) Dshell application fails when one of the docker container gets killed

     [ https://issues.apache.org/jira/browse/YARN-8231?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Chandni Singh resolved YARN-8231.
---------------------------------
    Resolution: Invalid

# Distributed shell application doesn't re-launch containers when it gets container completed event from Node Manager.
 # To enable NM retry failed containers, additional configs need to be provided. For eg. {{container_retry_policy}} and {{container_max_retries}}
# Force killing a container, that is, exit code 137 will not trigger a retry. 
{code}
  @Override
  public boolean shouldRetry(int errorCode) {
    if (errorCode == ExitCode.SUCCESS.getExitCode()
        || errorCode == ExitCode.FORCE_KILLED.getExitCode()
        || errorCode == ExitCode.TERMINATED.getExitCode()) {
      return false;
    }
    return retryPolicy.shouldRetry(windowRetryContext, errorCode);
  }
{code}

> Dshell application fails when one of the docker container gets killed
> ---------------------------------------------------------------------
>
>                 Key: YARN-8231
>                 URL: https://issues.apache.org/jira/browse/YARN-8231
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: yarn-native-services
>            Reporter: Yesha Vora
>            Priority: Critical
>
> 1) Launch dshell application
> {code}
> yarn  jar hadoop-yarn-applications-distributedshell-*.jar  -shell_command "sleep 300" -num_containers 2 -shell_env YARN_CONTAINER_RUNTIME_TYPE=docker -shell_env YARN_CONTAINER_RUNTIME_DOCKER_IMAGE=centos/httpd-24-centos7:latest -keep_containers_across_application_attempts -jar /usr/hdp/current/hadoop-yarn-client/hadoop-yarn-applications-distributedshell-*.jar{code}
> 2) Kill container_1524681858728_0012_01_000002
> Expected behavior:
> Application should start new instance and finish successfully
> Actual behavior:
> Application Failed as soon as container was killed
> {code:title=AM log}
> 18/04/27 23:05:12 INFO distributedshell.ApplicationMaster: Got response from RM for container ask, completedCnt=1
> 18/04/27 23:05:12 INFO distributedshell.ApplicationMaster: appattempt_1524681858728_0012_000001 got container status for containerID=container_1524681858728_0012_01_000002, state=COMPLETE, exitStatus=137, diagnostics=[2018-04-27 23:05:09.310]Container killed on request. Exit code is 137
> [2018-04-27 23:05:09.331]Container exited with a non-zero exit code 137. 
> [2018-04-27 23:05:09.332]Killed by external signal
> 18/04/27 23:08:46 INFO distributedshell.ApplicationMaster: Got response from RM for container ask, completedCnt=1
> 18/04/27 23:08:46 INFO distributedshell.ApplicationMaster: appattempt_1524681858728_0012_000001 got container status for containerID=container_1524681858728_0012_01_000003, state=COMPLETE, exitStatus=0, diagnostics=
> 18/04/27 23:08:46 INFO distributedshell.ApplicationMaster: Container completed successfully., containerId=container_1524681858728_0012_01_000003
> 18/04/27 23:08:46 INFO distributedshell.ApplicationMaster: Application completed. Stopping running containers
> 18/04/27 23:08:46 INFO distributedshell.ApplicationMaster: Application completed. Signalling finish to RM
> 18/04/27 23:08:46 INFO distributedshell.ApplicationMaster: Diagnostics., total=2, completed=2, allocated=2, failed=1
> 18/04/27 23:08:46 INFO impl.AMRMClientImpl: Waiting for application to be successfully unregistered.{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: yarn-issues-help@hadoop.apache.org