You are viewing a plain text version of this content. The canonical link for it is here.
Posted to yarn-issues@hadoop.apache.org by "Eric Yang (JIRA)" <ji...@apache.org> on 2018/07/26 18:31:00 UTC

[jira] [Comment Edited] (YARN-8587) Delays are noticed to launch docker container

    [ https://issues.apache.org/jira/browse/YARN-8587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16558720#comment-16558720 ] 

Eric Yang edited comment on YARN-8587 at 7/26/18 6:30 PM:
----------------------------------------------------------

This bug is result of docker run detach reports exit_code 0, but the process inside the container fail to run.  For a brief period of time, node manager will report back that container is in RUNNING state, then fail the container later.  One possible solution is to change container-executor for non-entry-point mode to become more similar to entry_point mode to run docker run in the foreground, and parent process have a set of retries for docker inspect to obtain PID.  This removes the possible false positive reporting of RUNNING state.  The synthetic timeout approach may kill container prematurely (or wait longer than necessary for failing container), if container takes more than 30 seconds (or configured values) to start the first process in the container.  Do we want to make non-entry-point to work like entry-point to prevent the false positive or we are ok with current state?


was (Author: eyang):
This bug is result of docker run detach reports exit_code 0, but the process inside the container fail to run.  For a brief period of time, node manager will report back that container is in RUNNING state, then fail the container later.  One possible solution is to change container-executor for non-entry-point mode to become more similar to entry_point mode to run docker run in the foreground, and parent process have a set of retries for docker inspect to obtain PID.  This removes the possible false positive reporting of RUNNING state.  The synthetic timeout approach may kill container prematurely (or wait longer than necessary for failing container), if container takes more than 30 seconds (or configured values) to start the first process in the container.

> Delays are noticed to launch docker container
> ---------------------------------------------
>
>                 Key: YARN-8587
>                 URL: https://issues.apache.org/jira/browse/YARN-8587
>             Project: Hadoop YARN
>          Issue Type: Bug
>    Affects Versions: 3.1.1
>            Reporter: Yesha Vora
>            Priority: Major
>
> Launch dshell application. Wait for application to go in RUNNING state.
> {code:java}
> yarn  jar /xx/hadoop-yarn-applications-distributedshell-*.jar  -shell_command "sleep 300" -num_containers 1 -shell_env YARN_CONTAINER_RUNTIME_TYPE=docker -shell_env YARN_CONTAINER_RUNTIME_DOCKER_IMAGE=httpd:0.1 -shell_env YARN_CONTAINER_RUNTIME_DOCKER_DELAYED_REMOVAL=true -jar /usr/hdp/current/hadoop-yarn-client/hadoop-yarn-applications-distributedshell-xx.jar
> {code}
> Find out container allocation. Run docker inspect command for docker containers launched by app.
> Sometimes, the container is allocated to NM but docker PID is not up.
> {code:java}
> Command ssh -q -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null xxx "sudo su - -c \"docker ps  -a | grep container_e02_1531189225093_0003_01_000002\" root" failed after 0 retries 
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: yarn-issues-help@hadoop.apache.org