You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@submarine.apache.org by "Zhankun Tang (Jira)" <ji...@apache.org> on 2020/03/26 00:34:00 UTC

[jira] [Commented] (SUBMARINE-457) Run TF MNIST example using Docker Container failed in mini-submarine

    [ https://issues.apache.org/jira/browse/SUBMARINE-457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17067256#comment-17067256 ] 

Zhankun Tang commented on SUBMARINE-457:
----------------------------------------

[~lowc1012], Thanks for reporting this. [~oliverhuhuhu@gmail.com], it seems the tony AM cannot log in. Any thoughts?

 

> Run TF MNIST example using Docker Container failed in mini-submarine 
> ---------------------------------------------------------------------
>
>                 Key: SUBMARINE-457
>                 URL: https://issues.apache.org/jira/browse/SUBMARINE-457
>             Project: Apache Submarine
>          Issue Type: Bug
>          Components: Mini Submarine
>    Affects Versions: 0.4.0
>            Reporter: Ryan Lo
>            Priority: Major
>
> I tried to run mnist_distributed.py using docker container, and launch failed.
> The following is my command, and the docker image tf-1.13.1-cpu-base:0.0.1 was build in advance in mini-submarine.
> {code:java}
> java -cp $(hadoop classpath --glob):/opt/submarine-current/submarine-all-0.3.0-SNAPSHOT-hadoop-3.2.jar org.apache.submarine.client.cli.Cli job run --name tf-job-001 \
>  --framework tensorflow \
>  --docker_image tf-1.13.1-cpu-base:0.0.1 \
>  --input_path "" \
>  --num_ps 1 \
>  --ps_launch_cmd "export CLASSPATH=\$(/hadoop-current/bin/hadoop classpath --glob) && python mnist_distributed.py --steps 2 --data_dir /tmp/data --working_dir /tmp/mode" \
>  --ps_resources memory=1G,vcores=1 \
>  --num_workers 2 \
>  --worker_resources memory=1G,vcores=1 \
>  --worker_launch_cmd "export CLASSPATH=\$(/hadoop-current/bin/hadoop classpath --glob) && python mnist_distributed.py --steps 2 --data_dir /tmp/data --working_dir /tmp/mode" \
>  --env JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64 \
>  --env DOCKER_HADOOP_HDFS_HOME=/hadoop-current \
>  --env DOCKER_JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64 \
>  --env HADOOP_HOME=/hadoop-current \
>  --env HADOOP_YARN_HOME=/hadoop-current \
>  --env HADOOP_COMMON_HOME=hadoop-current \
>  --env HADOOP_HDFS_HOME=/hadoop-current \
>  --env HADOOP_CONF_DIR=/hadoop-current/etc/hadoop \
>  --conf tony.containers.resources=/opt/submarine-current/submarine-all-0.3.0-SNAPSHOT-hadoop-3.2.jar,/home/yarn/submarine/mnist_distributed.py
> {code}
> The following is partial NodeManager log.
> {code:java}
> 2020-03-25 13:48:32,728 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl: Container container_1585136148243_0006_01_000001 transitioned from SCHEDULED to RUNNING
> 2020-03-25 13:48:32,728 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Starting resource-monitoring for container_1585136148243_0006_01_000001
> 2020-03-25 13:48:32,740 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DockerLinuxContainerRuntime: setting hostname in container to: ctr-1585136148243-0006-01-000001
> 2020-03-25 13:48:34,605 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DockerLinuxContainerRuntime: Docker inspect output for container_1585136148243_0006_01_000001: ,ctr-1585136148243-0006-01-0000012020-03-25 13:48:34,605 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: container_1585136148243_0006_01_000001's ip = , and hostname = ctr-1585136148243-0006-01-000001
> 2020-03-25 13:48:34,613 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Skipping monitoring container container_1585136148243_0006_01_000001 since CPU usage is not yet available.
> 2020-03-25 13:48:36,234 WARN org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor: Shell execution returned exit code: 255. Privileged Execution Operation Stderr:
> Docker container exit code was not zero: 255
> Unable to read from docker logs(ferror, feof): 0 1Stdout: main : command provided 4
> main : run as user is yarn
> main : requested yarn user is yarn
> Creating script paths...
> Creating local dirs...
> Getting exit code file...
> Changing effective user to root...
> Launching docker container...
> Inspecting docker container...
> Writing to cgroup task files...
> Writing pid file...
> Writing to tmp file /var/lib/hadoop-yarn/cache/yarn/nm-local-dir/nmPrivate/application_1585136148243_0006/container_1585136148243_0006_01_000001/container_1585136148243_0006_01_000001.pid.tmp
> container_1585136148243_0006_01_000001
> Waiting for docker container to finish...
> Removing docker container post-exit...
> {code}
> The following is AM stdout.log.
> {code:java}
> ========================================================================
> LogType:amstdout.log
> LogLastModifiedTime:Wed Mar 25 13:02:27 +0000 2020
> LogLength:6468
> LogContents:
> [WARN ] 2020-03-25 13:02:25,503 method:org.apache.hadoop.util.NativeCodeLoader.<clinit>(NativeCodeLoader.java:60)
> Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
> [ERROR] 2020-03-25 13:02:25,613 method:com.linkedin.tony.ApplicationMaster.init(ApplicationMaster.java:217)
> Failed to create FileSystem object
> org.apache.hadoop.security.KerberosAuthException: failure to login: javax.security.auth.login.LoginException: java.lang.NullPointerException: invalid null input: name
>  at com.sun.security.auth.UnixPrincipal.<init>(UnixPrincipal.java:71)
>  at com.sun.security.auth.module.UnixLoginModule.login(UnixLoginModule.java:133)
>  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>  at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>  at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  at java.lang.reflect.Method.invoke(Method.java:498)
>  at javax.security.auth.login.LoginContext.invoke(LoginContext.java:755)
>  at javax.security.auth.login.LoginContext.access$000(LoginContext.java:195)
>  at javax.security.auth.login.LoginContext$4.run(LoginContext.java:682)
>  at javax.security.auth.login.LoginContext$4.run(LoginContext.java:680)
>  at java.security.AccessController.doPrivileged(Native Method)
>  at javax.security.auth.login.LoginContext.invokePriv(LoginContext.java:680)
>  at javax.security.auth.login.LoginContext.login(LoginContext.java:587)
>  at org.apache.hadoop.security.UserGroupInformation$HadoopLoginContext.login(UserGroupInformation.java:1926)
>  at org.apache.hadoop.security.UserGroupInformation.doSubjectLogin(UserGroupInformation.java:1837)
>  at org.apache.hadoop.security.UserGroupInformation.createLoginUser(UserGroupInformation.java:710)
>  at org.apache.hadoop.security.UserGroupInformation.getLoginUser(UserGroupInformation.java:660)
>  at org.apache.hadoop.security.UserGroupInformation.getCurrentUser(UserGroupInformation.java:571)
>  at org.apache.hadoop.fs.FileSystem$Cache$Key.<init>(FileSystem.java:3487)
>  at org.apache.hadoop.fs.FileSystem$Cache$Key.<init>(FileSystem.java:3477)
>  at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:3319)
>  at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:479)
>  at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:227)
>  at com.linkedin.tony.ApplicationMaster.init(ApplicationMaster.java:215)
>  at com.linkedin.tony.ApplicationMaster.run(ApplicationMaster.java:305)
>  at com.linkedin.tony.ApplicationMaster.main(ApplicationMaster.java:293)
>  at org.apache.hadoop.security.UserGroupInformation.doSubjectLogin(UserGroupInformation.java:1847)
>  at org.apache.hadoop.security.UserGroupInformation.createLoginUser(UserGroupInformation.java:710)
>  at org.apache.hadoop.security.UserGroupInformation.getLoginUser(UserGroupInformation.java:660)
>  at org.apache.hadoop.security.UserGroupInformation.getCurrentUser(UserGroupInformation.java:571)
>  at org.apache.hadoop.fs.FileSystem$Cache$Key.<init>(FileSystem.java:3487)
>  at org.apache.hadoop.fs.FileSystem$Cache$Key.<init>(FileSystem.java:3477)
>  at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:3319)
>  at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:479)
>  at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:227)
>  at com.linkedin.tony.ApplicationMaster.init(ApplicationMaster.java:215)
>  at com.linkedin.tony.ApplicationMaster.run(ApplicationMaster.java:305)
>  at com.linkedin.tony.ApplicationMaster.main(ApplicationMaster.java:293)
> Caused by: javax.security.auth.login.LoginException: java.lang.NullPointerException: invalid null input: name
>  at com.sun.security.auth.UnixPrincipal.<init>(UnixPrincipal.java:71)
>  at com.sun.security.auth.module.UnixLoginModule.login(UnixLoginModule.java:133)
>  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>  at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>  at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  at java.lang.reflect.Method.invoke(Method.java:498)
>  at javax.security.auth.login.LoginContext.invoke(LoginContext.java:755)
>  at javax.security.auth.login.LoginContext.access$000(LoginContext.java:195)
>  at javax.security.auth.login.LoginContext$4.run(LoginContext.java:682)
>  at javax.security.auth.login.LoginContext$4.run(LoginContext.java:680)
>  at java.security.AccessController.doPrivileged(Native Method)
>  at javax.security.auth.login.LoginContext.invokePriv(LoginContext.java:680)
>  at javax.security.auth.login.LoginContext.login(LoginContext.java:587)
>  at org.apache.hadoop.security.UserGroupInformation$HadoopLoginContext.login(UserGroupInformation.java:1926)
>  at org.apache.hadoop.security.UserGroupInformation.doSubjectLogin(UserGroupInformation.java:1837)
>  at org.apache.hadoop.security.UserGroupInformation.createLoginUser(UserGroupInformation.java:710)
>  at org.apache.hadoop.security.UserGroupInformation.getLoginUser(UserGroupInformation.java:660)
>  at org.apache.hadoop.security.UserGroupInformation.getCurrentUser(UserGroupInformation.java:571)
>  at org.apache.hadoop.fs.FileSystem$Cache$Key.<init>(FileSystem.java:3487)
>  at org.apache.hadoop.fs.FileSystem$Cache$Key.<init>(FileSystem.java:3477)
>  at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:3319)
>  at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:479)
>  at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:227)
>  at com.linkedin.tony.ApplicationMaster.init(ApplicationMaster.java:215)
>  at com.linkedin.tony.ApplicationMaster.run(ApplicationMaster.java:305)
>  at com.linkedin.tony.ApplicationMaster.main(ApplicationMaster.java:293)
>  at javax.security.auth.login.LoginContext.invoke(LoginContext.java:856)
>  at javax.security.auth.login.LoginContext.access$000(LoginContext.java:195)
>  at javax.security.auth.login.LoginContext$4.run(LoginContext.java:682)
>  at javax.security.auth.login.LoginContext$4.run(LoginContext.java:680)
>  at java.security.AccessController.doPrivileged(Native Method)
>  at javax.security.auth.login.LoginContext.invokePriv(LoginContext.java:680)
>  at javax.security.auth.login.LoginContext.login(LoginContext.java:587)
>  at org.apache.hadoop.security.UserGroupInformation$HadoopLoginContext.login(UserGroupInformation.java:1926)
>  at org.apache.hadoop.security.UserGroupInformation.doSubjectLogin(UserGroupInformation.java:1837)
>  ... 11 more
> [INFO ] 2020-03-25 13:02:25,618 method:com.linkedin.tony.ApplicationMaster.main(ApplicationMaster.java:298)
> Application Master failed. Exiting
> End of LogType:amstdout.log
> *****************************************************************************{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@submarine.apache.org
For additional commands, e-mail: dev-help@submarine.apache.org