You are viewing a plain text version of this content. The canonical link for it is here.
Posted to yarn-issues@hadoop.apache.org by "Shane Kumpf (JIRA)" <ji...@apache.org> on 2018/03/06 15:35:00 UTC

[jira] [Comment Edited] (YARN-7999) Docker launch fails when user private filecache directory is missing

    [ https://issues.apache.org/jira/browse/YARN-7999?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16387937#comment-16387937 ] 

Shane Kumpf edited comment on YARN-7999 at 3/6/18 3:34 PM:
-----------------------------------------------------------

Thanks for the patch [~jlowe]! I still haven't been able to recreate this issue yet, so I think [~eyang] will need to validate if this fixes the issue for him.

Regarding a different approach, my first thought was to check for the existence of source directory in {{DockerLinuxContainerRuntime}} prior to adding the bind mount to the docker run command. If the directory doesn't exist, does it even make sense to request that bind mount?

However, it seems normalize_mounts is also called on each of the paths for the ro-mounts and rw-mounts in {{container-executor.cfg}}, so if the usercache/_user_/filecache is listed as a read-only or read-write mount, we'd still run into the error. The current documentation recommends configuring the ro-mounts and rw-mounts in {{container-executor.cfg}} to be the {{nm-local-dir}} root, versus the individual directories under the {{nm-local-dirs}}. Configuring {{container-executor.cfg}} in this manner would avoid this issue. IMO, attempting to enumerate every user's usercache/_user_/filecache directory in {{container-executor.cfg}} sounds like an administrative nightmare, but I guess at the same time we should allow that scenario if someone wants to do it.

Given this, I think your approach to ensuring the directory exists prior to launch is the way to go to address this edge case. 

Regarding review, the cc warning seems valid.


was (Author: shanekumpf@gmail.com):
Thanks for the patch [~jlowe]! I still haven't been able to recreate this issue yet, so I think [~eyang] will need to validate if this fixes the issue for him.

Regarding a different approach, my first thought was to check for the existence of source directory in {{DockerLinuxContainerRuntime}} prior to adding the bind mount to the docker run command. If the directory doesn't exist, does it even make sense to request that bind mount?

However, it seems normalize_mounts is also called on each of the paths for the ro-mounts and rw-mounts in {{container-executor.cfg}}, so if the usercache/_user_/filecache is listed as a read-only or read-write mount, we'd still run into the error. The current documentation recommends configuring the ro-mounts and rw-mounts in {{container-executor.cfg}} to be the {{nm-local-dir}} root, versus the individual directories under the {{nm-local-dirs}}. Configuring {{container-executor.cfg}} in this manner would avoid this issue. IMO, attempting to enumerate every user's usercache/_user_/filecache directory in {{container-executor.cfg}} sounds like an administrative nightmare, but I guess at the same time we should allow that scenario if someone wants to do it.

Given this, I think your approach to ensuring the directory exists prior to launch is the way to go to address this edge case. 

Regarding review, the cc warning seems valid and there is a missing free for the filecache_dir in the case where the all of the filecache_dir's can be determined and the mkdirs succeed.

> Docker launch fails when user private filecache directory is missing
> --------------------------------------------------------------------
>
>                 Key: YARN-7999
>                 URL: https://issues.apache.org/jira/browse/YARN-7999
>             Project: Hadoop YARN
>          Issue Type: Bug
>    Affects Versions: 3.1.0
>            Reporter: Eric Yang
>            Assignee: Jason Lowe
>            Priority: Major
>         Attachments: YARN-7999.001.patch
>
>
> Docker container is failing to launch in trunk.  The root cause is:
> {code}
> [COMPINSTANCE sleeper-1 : container_1520032931921_0001_01_000020]: [2018-03-02 23:26:09.196]Exception from container-launch.
> Container id: container_1520032931921_0001_01_000020
> Exit code: 29
> Exception message: image: hadoop/centos:latest is trusted in hadoop registry.
> Could not determine real path of mount '/tmp/hadoop-yarn/nm-local-dir/usercache/hbase/filecache'
> Could not determine real path of mount '/tmp/hadoop-yarn/nm-local-dir/usercache/hbase/filecache'
> Invalid docker mount '/tmp/hadoop-yarn/nm-local-dir/usercache/hbase/filecache:/tmp/hadoop-yarn/nm-local-dir/usercache/hbase/filecache', realpath=/tmp/hadoop-yarn/nm-local-dir/usercache/hbase/filecache
> Error constructing docker command, docker error code=12, error message='Invalid docker mount'
> Shell output: main : command provided 4
> main : run as user is hbase
> main : requested yarn user is hbase
> Creating script paths...
> Creating local dirs...
> [2018-03-02 23:26:09.240]Diagnostic message from attempt 0 : [2018-03-02 23:26:09.240]
> [2018-03-02 23:26:09.240]Container exited with a non-zero exit code 29.
> [2018-03-02 23:26:39.278]Could not find nmPrivate/application_1520032931921_0001/container_1520032931921_0001_01_000020//container_1520032931921_0001_01_000020.pid in any of the directories
> [COMPONENT sleeper]: Failed 11 times, exceeded the limit - 10. Shutting down now...
> {code}
> The filecache cant not be mounted because it doesn't exist.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: yarn-issues-help@hadoop.apache.org