You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@submarine.apache.org by li...@apache.org on 2019/12/26 14:29:17 UTC

[submarine] branch master updated: SUBMARINE-309. Make yarn service configuraiton optional in installation guide.

This is an automated email from the ASF dual-hosted git repository.

liuxun pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/submarine.git


The following commit(s) were added to refs/heads/master by this push:
     new 145dd22  SUBMARINE-309. Make yarn service configuraiton optional in installation guide.
145dd22 is described below

commit 145dd2262695e8262b05249e0b74009570387d70
Author: Zac Zhou <zh...@apache.org>
AuthorDate: Fri Dec 20 20:10:27 2019 +0800

    SUBMARINE-309. Make yarn service configuraiton optional in installation guide.
    
    ### What is this PR for?
    At this time, Submarine has two runtimes for yarn clusters, yarn service and tony. So the installation guide should work for both two rumtimes. As the installation of calico and registrydns is only for yarn service, they should be marked as optional.
    
    ### What type of PR is it?
    Improvement
    
    ### Todos
    * [ ] - Task
    
    ### What is the Jira issue?
    https://issues.apache.org/jira/browse/SUBMARINE-309
    
    ### How should this be tested?
    https://travis-ci.org/yuanzac/hadoop-submarine/builds/627688854?utm_source=github_status&utm_medium=notification
    
    ### Screenshots (if appropriate)
    
    ### Questions:
    * Does the licenses files need update? No
    * Is there breaking changes for older versions? No
    * Does this needs documentation? Yes
    
    Author: Zac Zhou <zh...@apache.org>
    
    Closes #131 from yuanzac/topic/SUBMARINE-309 and squashes the following commits:
    
    93e758c [Zac Zhou] SUBMARINE-309. Make yarn service configuraiton optional in installation guide.
---
 .../submarine-installer/InstallationGuide.md       | 266 ++++++++++++++------
 .../InstallationGuideChineseVersion.md             | 272 ++++++++++++++-------
 .../runtime/fs/MockRemoteDirectoryManager.java     |   2 +-
 .../tensorflow/TensorFlowConfigEnvGenerator.java   |  86 +++----
 .../TensorFlowConfigEnvGeneratorTest.java          | 132 ++--------
 5 files changed, 445 insertions(+), 313 deletions(-)

diff --git a/dev-support/submarine-installer/InstallationGuide.md b/dev-support/submarine-installer/InstallationGuide.md
index bfc168b..0e3a328 100644
--- a/dev-support/submarine-installer/InstallationGuide.md
+++ b/dev-support/submarine-installer/InstallationGuide.md
@@ -381,82 +381,12 @@ If there are some errors, we could check the following configuration.
    ls -l /usr/local/nvidia/lib64 | grep libcuda.so
    ```
 
-### Etcd Installation
-
-etcd is a distributed reliable key-value store for the most critical data of a distributed system, Registration and discovery of services used in containers.
-You can also choose alternatives like zookeeper, Consul.
-
-To install Etcd on specified servers, we can run Submarine-installer/install.sh
-
-```shell
-$ ./Submarine-installer/install.sh
-# Etcd status
-systemctl status Etcd.service
-```
-
-Check Etcd cluster health
-
-```shell
-$ etcdctl cluster-health
-member 3adf2673436aa824 is healthy: got healthy result from http://${etcd_host_ip1}:2379
-member 85ffe9aafb7745cc is healthy: got healthy result from http://${etcd_host_ip2}:2379
-member b3d05464c356441a is healthy: got healthy result from http://${etcd_host_ip3}:2379
-cluster is healthy
-
-$ etcdctl member list
-3adf2673436aa824: name=etcdnode3 peerURLs=http://${etcd_host_ip1}:2380 clientURLs=http://${etcd_host_ip1}:2379 isLeader=false
-85ffe9aafb7745cc: name=etcdnode2 peerURLs=http://${etcd_host_ip2}:2380 clientURLs=http://${etcd_host_ip2}:2379 isLeader=false
-b3d05464c356441a: name=etcdnode1 peerURLs=http://${etcd_host_ip3}:2380 clientURLs=http://${etcd_host_ip3}:2379 isLeader=true
-```
-
-
-
-### Calico Installation
-
-Calico creates and manages a flat three-tier network, and each container is assigned a routable ip. We just add the steps here for your convenience.
-You can also choose alternatives like Flannel, OVS.
-
-To install Calico on specified servers, we can run Submarine-installer/install.sh
-
-```
-systemctl start calico-node.service
-systemctl status calico-node.service
-```
-
-#### Check Calico Network
-
-```shell
-# Run the following command to show the all host status in the cluster except localhost.
-$ calicoctl node status
-Calico process is running.
-
-IPv4 BGP status
-+---------------+-------------------+-------+------------+-------------+
-| PEER ADDRESS  |     PEER TYPE     | STATE |   SINCE    |    INFO     |
-+---------------+-------------------+-------+------------+-------------+
-| ${host_ip1} | node-to-node mesh | up    | 2018-09-21 | Established |
-| ${host_ip2} | node-to-node mesh | up    | 2018-09-21 | Established |
-| ${host_ip3} | node-to-node mesh | up    | 2018-09-21 | Established |
-+---------------+-------------------+-------+------------+-------------+
-
-IPv6 BGP status
-No IPv6 peers found.
-```
-
-Create containers to validate calico network
-
-```
-docker network create --driver calico --ipam-driver calico-ipam calico-network
-docker run --net calico-network --name workload-A -tid busybox
-docker run --net calico-network --name workload-B -tid busybox
-docker exec workload-A ping workload-B
-```
-
 
 ## Hadoop Installation
 
 ### Get Hadoop Release
-You can either get Hadoop release binary or compile from source code. Please follow the https://hadoop.apache.org/ guides.
+You can either get Hadoop release binary or compile from source code. Please follow the guides from [Hadoop Homepage](https://hadoop.apache.org/).
+For hadoop cluster setup, please refer to [Hadoop Cluster Setup](https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/ClusterSetup.html)
 
 
 ### Start yarn service
@@ -468,20 +398,12 @@ YARN_LOGFILE=timeline.log ./sbin/yarn-daemon.sh start timelineserver
 YARN_LOGFILE=mr-historyserver.log ./sbin/mr-jobhistory-daemon.sh start historyserver
 ```
 
-### Start yarn registery dns service
-
-```
-sudo YARN_LOGFILE=registrydns.log ./yarn-daemon.sh start registrydns
-```
-
 ### Test with a MR wordcount job
 
 ```
 ./bin/hadoop jar /home/hadoop/hadoop-current/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.2.0-SNAPSHOT.jar wordcount /tmp/wordcount.txt /tmp/wordcount-output4
 ```
 
-
-
 ## Tensorflow Job with CPU
 
 ### Standalone Mode
@@ -599,3 +521,187 @@ Add configurations in container-executor.cfg
    root=/sys/fs/cgroup
    yarn-hierarchy=/hadoop-yarn
    ```
+
+## Yarn Service Runtime Requirement (Deprecated)
+
+The function of "yarn native service" is available since hadoop 3.1.0.
+Submarine supports to utilize yarn native service to submit a ML job. However, as
+there are several other components required. It is hard to enable
+and maintain the components. So yarn service runtime is deprecated since submarine 0.3.0.
+We recommend to use YarnRuntime instead. If you still want to enable it, please
+follow these steps.
+
+### Etcd Installation
+
+etcd is a distributed reliable key-value store for the most critical data of a distributed system, Registration and discovery of services used in containers.
+You can also choose alternatives like zookeeper, Consul.
+
+To install Etcd on specified servers, we can run Submarine-installer/install.sh
+
+```shell
+$ ./Submarine-installer/install.sh
+# Etcd status
+systemctl status Etcd.service
+```
+
+Check Etcd cluster health
+
+```shell
+$ etcdctl cluster-health
+member 3adf2673436aa824 is healthy: got healthy result from http://${etcd_host_ip1}:2379
+member 85ffe9aafb7745cc is healthy: got healthy result from http://${etcd_host_ip2}:2379
+member b3d05464c356441a is healthy: got healthy result from http://${etcd_host_ip3}:2379
+cluster is healthy
+
+$ etcdctl member list
+3adf2673436aa824: name=etcdnode3 peerURLs=http://${etcd_host_ip1}:2380 clientURLs=http://${etcd_host_ip1}:2379 isLeader=false
+85ffe9aafb7745cc: name=etcdnode2 peerURLs=http://${etcd_host_ip2}:2380 clientURLs=http://${etcd_host_ip2}:2379 isLeader=false
+b3d05464c356441a: name=etcdnode1 peerURLs=http://${etcd_host_ip3}:2380 clientURLs=http://${etcd_host_ip3}:2379 isLeader=true
+```
+
+### Calico Installation
+
+Calico creates and manages a flat three-tier network, and each container is assigned a routable ip. We just add the steps here for your convenience.
+You can also choose alternatives like Flannel, OVS.
+
+To install Calico on specified servers, we can run Submarine-installer/install.sh
+
+```
+systemctl start calico-node.service
+systemctl status calico-node.service
+```
+
+#### Check Calico Network
+
+```shell
+# Run the following command to show the all host status in the cluster except localhost.
+$ calicoctl node status
+Calico process is running.
+
+IPv4 BGP status
++---------------+-------------------+-------+------------+-------------+
+| PEER ADDRESS  |     PEER TYPE     | STATE |   SINCE    |    INFO     |
++---------------+-------------------+-------+------------+-------------+
+| ${host_ip1} | node-to-node mesh | up    | 2018-09-21 | Established |
+| ${host_ip2} | node-to-node mesh | up    | 2018-09-21 | Established |
+| ${host_ip3} | node-to-node mesh | up    | 2018-09-21 | Established |
++---------------+-------------------+-------+------------+-------------+
+
+IPv6 BGP status
+No IPv6 peers found.
+```
+
+Create containers to validate calico network
+
+```
+docker network create --driver calico --ipam-driver calico-ipam calico-network
+docker run --net calico-network --name workload-A -tid busybox
+docker run --net calico-network --name workload-B -tid busybox
+docker exec workload-A ping workload-B
+```
+
+### Enable calico network for docker container
+Set yarn-site.xml to use bridge for docker container
+
+```
+<property>
+    <name>yarn.nodemanager.runtime.linux.docker.default-container-network</name>
+    <value>calico-network</value>
+  </property>
+  <property>
+    <name>yarn.nodemanager.runtime.linux.allowed-runtimes</name>
+    <value>default,docker</value>
+  </property>
+  <property>
+    <name>yarn.nodemanager.runtime.linux.docker.allowed-container-networks</name>
+    <value>host,none,bridge,calico-network</value>
+  </property>
+```
+
+Add calico-network to container-executor.cfg
+
+```
+docker.allowed.networks=bridge,host,none,calico-network
+```
+
+Then restart all nodemanagers.
+
+### Start yarn registery dns service
+
+Yarn registry nds server exposes existing service-discovery information via DNS
+and enables docker containers to IP mappings. By using it, the containers of a 
+ML job knows how to communicate with each other.
+
+Please specify a server to start yarn registery dns service. For details please
+refer to [Registry DNS Server](http://hadoop.apache.org/docs/r3.1.0/hadoop-yarn/hadoop-yarn-site/yarn-service/RegistryDNS.html)
+
+```
+sudo YARN_LOGFILE=registrydns.log ./yarn-daemon.sh start registrydns
+```
+
+### Run submarine job
+
+Set submarine.runtime.class to YarnServiceRuntimeFactory in submarine-site.xml.
+```
+<property>
+    <name>submarine.runtime.class</name>
+    <value>org.apache.submarine.server.submitter.yarnservice.YarnServiceRuntimeFactory</value>
+    <description>RuntimeFactory for Submarine jobs</description>
+  </property>
+```
+
+#### Standalone Mode
+
+Suppose we want to submit a tensorflow job named standalone-tf, destroy any application with the same name and clean up historical job directories.
+
+```bash
+./bin/yarn app -destroy standalone-tf
+./bin/hdfs dfs -rmr hdfs://${dfs_name_service}/tmp/cifar-10-jobdir
+```
+where ${dfs_name_service} is the hdfs name service you use.
+
+Run a standalone tensorflow job
+
+```
+CLASSPATH=`${HADOOP_HOME}/bin/hadoop classpath --glob`:${SUBMARINE_HOME}/submarine-all-${SUBMARINE_VERSION}.jar:
+${SUBMARINE_HOME}/conf: \
+java org.apache.submarine.client.cli.Cli job run \
+ --env DOCKER_JAVA_HOME=/opt/java \
+ --env YARN_CONTAINER_RUNTIME_DOCKER_CONTAINER_NETWORK=calico-network \
+ --env DOCKER_HADOOP_HDFS_HOME=/hadoop-3.1.0 --name standalone-tf \
+ --docker_image dockerfile-cpu-tf1.8.0-with-models \
+ --input_path hdfs://${dfs_name_service}/tmp/cifar-10-data \
+ --checkpoint_path hdfs://${dfs_name_service}/user/hadoop/tf-checkpoint \
+ --worker_resources memory=4G,vcores=2 --verbose \
+ --worker_launch_cmd "python /test/cifar10_estimator/cifar10_main.py --data-dir=hdfs://${dfs_name_service}/tmp/cifar-10-data --job-dir=hdfs://${dfs_name_service}/tmp/cifar-10-jobdir --train-steps=500 --eval-batch-size=16 --train-batch-size=16 --num-gpus=0"
+```
+
+#### Distributed Mode
+
+Clean up apps with the same name
+
+```bash
+./bin/yarn app -destroy distributed-tf
+./bin/hdfs dfs -rmr hdfs://${dfs_name_service}/tmp/cifar-10-jobdir
+```
+
+Run a distributed tensorflow job
+
+```
+CLASSPATH=`${HADOOP_HOME}/bin/hadoop classpath --glob`:${SUBMARINE_HOME}/submarine-all-${SUBMARINE_VERSION}.jar:
+${SUBMARINE_HOME}/conf: \
+java org.apache.submarine.client.cli.Cli job run \
+ --env DOCKER_JAVA_HOME=/opt/java \
+ --env YARN_CONTAINER_RUNTIME_DOCKER_CONTAINER_NETWORK=calico-network \
+ --env DOCKER_HADOOP_HDFS_HOME=/hadoop-3.1.0 --name distributed-tf \
+ --env YARN_CONTAINER_RUNTIME_DOCKER_CONTAINER_NETWORK=calico-network \
+ --docker_image dockerfile-cpu-tf1.8.0-with-models \
+ --input_path hdfs://${dfs_name_service}/tmp/cifar-10-data \
+ --checkpoint_path hdfs://${dfs_name_service}/user/hadoop/tf-distributed-checkpoint \
+ --worker_resources memory=4G,vcores=2 --verbose \
+ --num_ps 1 \
+ --ps_resources memory=4G,vcores=2 \
+ --ps_launch_cmd "python /test/cifar10_estimator/cifar10_main.py --data-dir=hdfs://${dfs_name_service}/tmp/cifar-10-data --job-dir=hdfs://${dfs_name_service}/tmp/cifar-10-jobdir --num-gpus=0" \
+ --num_workers 4 \
+ --worker_launch_cmd "python /test/cifar10_estimator/cifar10_main.py --data-dir=hdfs://${dfs_name_service}/tmp/cifar-10-data --job-dir=hdfs://${dfs_name_service}/tmp/cifar-10-jobdir --train-steps=500 --eval-batch-size=16 --train-batch-size=16 --sync --num-gpus=0"
+```
diff --git a/dev-support/submarine-installer/InstallationGuideChineseVersion.md b/dev-support/submarine-installer/InstallationGuideChineseVersion.md
index f81acbe..5339a74 100644
--- a/dev-support/submarine-installer/InstallationGuideChineseVersion.md
+++ b/dev-support/submarine-installer/InstallationGuideChineseVersion.md
@@ -72,7 +72,7 @@ sudo /usr/local/cuda-10.0/bin/uninstall_cuda_10.0.pl
 sudo /usr/bin/nvidia-uninstall
 ```
 
-安装nvidia-detect,用于检查显卡版本
+安装 nvidia-detect,用于检查显卡版本
 
 ```
 yum install nvidia-detect
@@ -85,7 +85,7 @@ This device requires the current 390.87 NVIDIA driver kmod-nvidia
 An Intel display controller was also detected
 ```
 
-注意这里的信息 [Quadro K620] 和390.87。
+注意这里的信息 [Quadro K620] 和 390.87。
 下载 [NVIDIA-Linux-x86_64-390.87.run](https://www.nvidia.com/object/linux-amd64-display-archive.html)
 
 
@@ -266,10 +266,10 @@ https://github.com/NVIDIA/nvidia-docker/tree/1.0
 
 ### Tensorflow Image
 
-CUDNN 和 CUDA 其实不需要在物理机上安装,因为 Sumbmarine 中提供了已经包含了CUDNN 和 CUDA 的镜像文件,基础的Dockfile可参见WriteDockerfile.md
+CUDNN 和 CUDA 其实不需要在物理机上安装,因为 Sumbmarine 中提供了已经包含了 CUDNN 和 CUDA 的镜像文件,基础的 Dockfile 可参见 WriteDockerfile.md
 
 
-上述images无法支持kerberos环境,如果需要kerberos可以使用如下Dockfile
+上述 images 无法支持 kerberos 环境,如果需要 kerberos 可以使用如下 Dockfile
 
 ```shell
 FROM nvidia/cuda:9.0-cudnn7-devel-ubuntu16.04
@@ -368,83 +368,16 @@ $ python >> tf.__version__
    ls -l /usr/local/nvidia/lib64 | grep libcuda.so
    ```
 
-### 安装 Etcd
-
-运行 Submarine/install.sh 脚本,就可以在指定服务器中安装 Etcd 组件和服务自启动脚本。
-
-```shell
-$ ./Submarine/install.sh
-# 通过如下命令查看 Etcd 服务状态
-systemctl status Etcd.service
-```
-
-检查 Etcd 服务状态
-
-```shell
-$ etcdctl cluster-health
-member 3adf2673436aa824 is healthy: got healthy result from http://${etcd_host_ip1}:2379
-member 85ffe9aafb7745cc is healthy: got healthy result from http://${etcd_host_ip2}:2379
-member b3d05464c356441a is healthy: got healthy result from http://${etcd_host_ip3}:2379
-cluster is healthy
-
-$ etcdctl member list
-3adf2673436aa824: name=etcdnode3 peerURLs=http://${etcd_host_ip1}:2380 clientURLs=http://${etcd_host_ip1}:2379 isLeader=false
-85ffe9aafb7745cc: name=etcdnode2 peerURLs=http://${etcd_host_ip2}:2380 clientURLs=http://${etcd_host_ip2}:2379 isLeader=false
-b3d05464c356441a: name=etcdnode1 peerURLs=http://${etcd_host_ip3}:2380 clientURLs=http://${etcd_host_ip3}:2379 isLeader=true
-```
-其中,${etcd_host_ip*} 是etcd服务器的ip
-
-
-### 安装 Calico
-
-运行 Submarine/install.sh 脚本,就可以在指定服务器中安装 Calico 组件和服务自启动脚本。
-
-```
-systemctl start calico-node.service
-systemctl status calico-node.service
-```
-
-#### 检查 Calico 网络
-
-```shell
-# 执行如下命令,注意:不会显示本服务器的状态,只显示其他的服务器状态
-$ calicoctl node status
-Calico process is running.
-
-IPv4 BGP status
-+---------------+-------------------+-------+------------+-------------+
-| PEER ADDRESS  |     PEER TYPE     | STATE |   SINCE    |    INFO     |
-+---------------+-------------------+-------+------------+-------------+
-| ${host_ip1} | node-to-node mesh | up    | 2018-09-21 | Established |
-| ${host_ip2} | node-to-node mesh | up    | 2018-09-21 | Established |
-| ${host_ip3} | node-to-node mesh | up    | 2018-09-21 | Established |
-+---------------+-------------------+-------+------------+-------------+
-
-IPv6 BGP status
-No IPv6 peers found.
-```
-
-创建docker container,验证calico网络
-
-```
-docker network create --driver calico --ipam-driver calico-ipam calico-network
-docker run --net calico-network --name workload-A -tid busybox
-docker run --net calico-network --name workload-B -tid busybox
-docker exec workload-A ping workload-B
-```
-
-
 ## 安装 Hadoop
 
-### 编译 Hadoop
-
-```
-mvn package -Pdist -DskipTests -Dtar
-```
+### 安装 Hadoop
+首先,我们通过源码编译或者直接从官网 [Hadoop Homepage](https://hadoop.apache.org/)下载获取 hadoop 包。
+然后,请参考 [Hadoop Cluster Setup](https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/ClusterSetup.html)
+进行 Hadoop 集群安装。
 
 
 
-### 启动 YARN服务
+### 启动 YARN 服务
 
 ```
 YARN_LOGFILE=resourcemanager.log ./sbin/yarn-daemon.sh start resourcemanager
@@ -453,13 +386,6 @@ YARN_LOGFILE=timeline.log ./sbin/yarn-daemon.sh start timelineserver
 YARN_LOGFILE=mr-historyserver.log ./sbin/mr-jobhistory-daemon.sh start historyserver
 ```
 
-### 启动 registery dns 服务
-
-```
-sudo YARN_LOGFILE=registrydns.log ./yarn-daemon.sh start registrydns
-```
-
-
 
 ### 测试 wordcount
 
@@ -614,6 +540,182 @@ Distributed-shell + GPU + cgroup
  --worker_launch_cmd "python /test/cifar10_estimator/cifar10_main.py --data-dir=hdfs://${dfs_name_service}/tmp/cifar-10-data --job-dir=hdfs://${dfs_name_service}/tmp/cifar-10-jobdir --train-steps=500 --eval-batch-size=16 --train-batch-size=16 --sync --num-gpus=1"
 ```
 
+## Yarn Service Runtime (不推荐)
+
+hadoop 3.1.0 提供了 yarn native service 功能,Submarine 可以利用 yarn native service 提交分布式机器学习任务。
+但是,由于使用 yarn native service 会引入一些额外的组件,导致部署和运维服务比较困难,因而在 Submarine 0.3.0之后 Yarn Server Runtime 不再推荐使用。我们建议直接使用
+YarnRuntime,这样可以在 yarn 2.9 上提交机器学习任务。
+开启 Yarn Service Runtime,可以参照下面的方法
+
+### 安装 Etcd
+
+运行 Submarine/install.sh 脚本,就可以在指定服务器中安装 Etcd 组件和服务自启动脚本。
+
+```shell
+$ ./Submarine/install.sh
+# 通过如下命令查看 Etcd 服务状态
+systemctl status Etcd.service
+```
+
+检查 Etcd 服务状态
+
+```shell
+$ etcdctl cluster-health
+member 3adf2673436aa824 is healthy: got healthy result from http://${etcd_host_ip1}:2379
+member 85ffe9aafb7745cc is healthy: got healthy result from http://${etcd_host_ip2}:2379
+member b3d05464c356441a is healthy: got healthy result from http://${etcd_host_ip3}:2379
+cluster is healthy
+
+$ etcdctl member list
+3adf2673436aa824: name=etcdnode3 peerURLs=http://${etcd_host_ip1}:2380 clientURLs=http://${etcd_host_ip1}:2379 isLeader=false
+85ffe9aafb7745cc: name=etcdnode2 peerURLs=http://${etcd_host_ip2}:2380 clientURLs=http://${etcd_host_ip2}:2379 isLeader=false
+b3d05464c356441a: name=etcdnode1 peerURLs=http://${etcd_host_ip3}:2380 clientURLs=http://${etcd_host_ip3}:2379 isLeader=true
+```
+其中,${etcd_host_ip*} 是 etcd 服务器的 ip
+
+
+### 安装 Calico
+
+运行 Submarine/install.sh 脚本,就可以在指定服务器中安装 Calico 组件和服务自启动脚本。
+
+```
+systemctl start calico-node.service
+systemctl status calico-node.service
+```
+
+#### 检查 Calico 网络
+
+```shell
+# 执行如下命令,注意:不会显示本服务器的状态,只显示其他的服务器状态
+$ calicoctl node status
+Calico process is running.
+
+IPv4 BGP status
++---------------+-------------------+-------+------------+-------------+
+| PEER ADDRESS  |     PEER TYPE     | STATE |   SINCE    |    INFO     |
++---------------+-------------------+-------+------------+-------------+
+| ${host_ip1} | node-to-node mesh | up    | 2018-09-21 | Established |
+| ${host_ip2} | node-to-node mesh | up    | 2018-09-21 | Established |
+| ${host_ip3} | node-to-node mesh | up    | 2018-09-21 | Established |
++---------------+-------------------+-------+------------+-------------+
+
+IPv6 BGP status
+No IPv6 peers found.
+```
+
+创建 docker container,验证 calico 网络
+
+```
+docker network create --driver calico --ipam-driver calico-ipam calico-network
+docker run --net calico-network --name workload-A -tid busybox
+docker run --net calico-network --name workload-B -tid busybox
+docker exec workload-A ping workload-B
+```
+
+### Yarn Docker container开启Calico网络
+在配置文件 yarn-site.xml,为 docker container 设置 Calico 网络。
+
+```
+<property>
+    <name>yarn.nodemanager.runtime.linux.docker.default-container-network</name>
+    <value>calico-network</value>
+  </property>
+  <property>
+    <name>yarn.nodemanager.runtime.linux.allowed-runtimes</name>
+    <value>default,docker</value>
+  </property>
+  <property>
+    <name>yarn.nodemanager.runtime.linux.docker.allowed-container-networks</name>
+    <value>host,none,bridge,calico-network</value>
+  </property>
+```
+
+在配置文件 container-executor.cfg 中,添加 bridge 网络
+
+```
+docker.allowed.networks=bridge,host,none,calico-network
+```
+
+重启所有的 nodemanager 节点.
+
+
+### 启动 registery dns 服务
+
+Yarn registry nds server 是为服务发现功能而实现的DNS服务。yarn docker container 通过向 registry nds server 注册,对外暴露 container 域名与 container IP/port 的映射关系。
+
+Yarn registery dns 的详细配置信息和部署方式,可以参考 [Registry DNS Server](http://hadoop.apache.org/docs/r3.1.0/hadoop-yarn/hadoop-yarn-site/yarn-service/RegistryDNS.html)
+
+启动 registry nds 命令
+```
+sudo YARN_LOGFILE=registrydns.log ./yarn-daemon.sh start registrydns
+```
+
+### 运行 submarine 任务
+
+在配置文件 submarine-site.xml 中设置 submarine.runtime.class
+```
+<property>
+    <name>submarine.runtime.class</name>
+    <value>org.apache.submarine.server.submitter.yarnservice.YarnServiceRuntimeFactory</value>
+    <description>RuntimeFactory for Submarine jobs</description>
+  </property>
+```
+
+#### 单机模式
+
+清理重名任务
+
+```bash
+./bin/yarn app -destroy standalone-tf
+./bin/hdfs dfs -rmr hdfs://${dfs_name_service}/tmp/cifar-10-jobdir
+```
+其中,变量 ${dfs_name_service} 请根据环境,用你的 hdfs name service 名称替换
+
+执行单机模式的 tensorflow 任务
+
+```
+CLASSPATH=`${HADOOP_HOME}/bin/hadoop classpath --glob`:${SUBMARINE_HOME}/submarine-all-${SUBMARINE_VERSION}.jar:
+${SUBMARINE_HOME}/conf: \
+java org.apache.submarine.client.cli.Cli job run \
+ --env DOCKER_JAVA_HOME=/opt/java \
+ --env YARN_CONTAINER_RUNTIME_DOCKER_CONTAINER_NETWORK=calico-network \
+ --env DOCKER_HADOOP_HDFS_HOME=/hadoop-3.1.0 --name standalone-tf \
+ --docker_image dockerfile-cpu-tf1.8.0-with-models \
+ --input_path hdfs://${dfs_name_service}/tmp/cifar-10-data \
+ --checkpoint_path hdfs://${dfs_name_service}/user/hadoop/tf-checkpoint \
+ --worker_resources memory=4G,vcores=2 --verbose \
+ --worker_launch_cmd "python /test/cifar10_estimator/cifar10_main.py --data-dir=hdfs://${dfs_name_service}/tmp/cifar-10-data --job-dir=hdfs://${dfs_name_service}/tmp/cifar-10-jobdir --train-steps=500 --eval-batch-size=16 --train-batch-size=16 --num-gpus=0"
+```
+
+#### 分布式模式
+
+清理重名任务
+
+```bash
+./bin/yarn app -destroy distributed-tf
+./bin/hdfs dfs -rmr hdfs://${dfs_name_service}/tmp/cifar-10-jobdir
+```
+
+提交分布式模式 tensorflow 任务
+
+```
+CLASSPATH=`${HADOOP_HOME}/bin/hadoop classpath --glob`:${SUBMARINE_HOME}/submarine-all-${SUBMARINE_VERSION}.jar:
+${SUBMARINE_HOME}/conf: \
+java org.apache.submarine.client.cli.Cli job run \
+ --env DOCKER_JAVA_HOME=/opt/java \
+ --env YARN_CONTAINER_RUNTIME_DOCKER_CONTAINER_NETWORK=calico-network \
+ --env DOCKER_HADOOP_HDFS_HOME=/hadoop-3.1.0 --name distributed-tf \
+ --env YARN_CONTAINER_RUNTIME_DOCKER_CONTAINER_NETWORK=calico-network \
+ --docker_image dockerfile-cpu-tf1.8.0-with-models \
+ --input_path hdfs://${dfs_name_service}/tmp/cifar-10-data \
+ --checkpoint_path hdfs://${dfs_name_service}/user/hadoop/tf-distributed-checkpoint \
+ --worker_resources memory=4G,vcores=2 --verbose \
+ --num_ps 1 \
+ --ps_resources memory=4G,vcores=2 \
+ --ps_launch_cmd "python /test/cifar10_estimator/cifar10_main.py --data-dir=hdfs://${dfs_name_service}/tmp/cifar-10-data --job-dir=hdfs://${dfs_name_service}/tmp/cifar-10-jobdir --num-gpus=0" \
+ --num_workers 4 \
+ --worker_launch_cmd "python /test/cifar10_estimator/cifar10_main.py --data-dir=hdfs://${dfs_name_service}/tmp/cifar-10-data --job-dir=hdfs://${dfs_name_service}/tmp/cifar-10-jobdir --train-steps=500 --eval-batch-size=16 --train-batch-size=16 --sync --num-gpus=0"
+```
 
 
 ## 问题
@@ -643,7 +745,7 @@ chown :yarn -R /sys/fs/cgroup/cpu,cpuacct
 chmod g+rwx -R /sys/fs/cgroup/cpu,cpuacct
 ```
 
-在支持gpu时,还需cgroup devices路径权限
+在支持 gpu 时,还需 cgroup devices 路径权限
 
 ```
 chown :yarn -R /sys/fs/cgroup/devices
@@ -720,7 +822,7 @@ $ kill -9 5007
 ```
 
 
-### 问题五:命令sudo nvidia-docker run 报错
+### 问题五:命令 sudo nvidia-docker run 报错
 
 ```
 docker: Error response from daemon: create nvidia_driver_361.42: VolumeDriver.Create: internal error, check logs for details.
diff --git a/submarine-commons/commons-runtime/src/test/java/org/apache/submarine/commons/runtime/fs/MockRemoteDirectoryManager.java b/submarine-commons/commons-runtime/src/test/java/org/apache/submarine/commons/runtime/fs/MockRemoteDirectoryManager.java
index 71331d5..f3ea39b 100644
--- a/submarine-commons/commons-runtime/src/test/java/org/apache/submarine/commons/runtime/fs/MockRemoteDirectoryManager.java
+++ b/submarine-commons/commons-runtime/src/test/java/org/apache/submarine/commons/runtime/fs/MockRemoteDirectoryManager.java
@@ -80,7 +80,7 @@ public class MockRemoteDirectoryManager implements RemoteDirectoryManager {
   private File initializeModelParentDir() throws IOException {
     File dir = new File(
         "target/_models_" + System.currentTimeMillis());
-    if (!dir.mkdirs()) {
+    if (!dir.exists() && !dir.mkdirs()) {
       throw new IOException(
           String.format(FAILED_TO_CREATE_DIRS_FORMAT_STRING,
               dir.getAbsolutePath()));
diff --git a/submarine-server/server-submitter/submitter-yarnservice/src/main/java/org/apache/submarine/server/submitter/yarnservice/tensorflow/TensorFlowConfigEnvGenerator.java b/submarine-server/server-submitter/submitter-yarnservice/src/main/java/org/apache/submarine/server/submitter/yarnservice/tensorflow/TensorFlowConfigEnvGenerator.java
index 45de9e7..2394eea 100644
--- a/submarine-server/server-submitter/submitter-yarnservice/src/main/java/org/apache/submarine/server/submitter/yarnservice/tensorflow/TensorFlowConfigEnvGenerator.java
+++ b/submarine-server/server-submitter/submitter-yarnservice/src/main/java/org/apache/submarine/server/submitter/yarnservice/tensorflow/TensorFlowConfigEnvGenerator.java
@@ -19,32 +19,11 @@
 
 package org.apache.submarine.server.submitter.yarnservice.tensorflow;
 
-import com.fasterxml.jackson.annotation.JsonInclude;
-import com.fasterxml.jackson.core.JsonProcessingException;
-import com.fasterxml.jackson.databind.DeserializationFeature;
-import com.fasterxml.jackson.databind.ObjectMapper;
-import com.fasterxml.jackson.databind.SerializationFeature;
-import com.fasterxml.jackson.databind.node.ArrayNode;
-import com.fasterxml.jackson.databind.node.ObjectNode;
-import com.fasterxml.jackson.databind.type.TypeFactory;
-import com.fasterxml.jackson.module.jaxb.JaxbAnnotationIntrospector;
 import org.apache.submarine.commons.runtime.conf.Envs;
 import org.apache.submarine.server.submitter.yarnservice.YarnServiceUtils;
 
 public class TensorFlowConfigEnvGenerator {
 
-  private static final ObjectMapper OBJECT_MAPPER = createObjectMapper();
-
-  private static ObjectMapper createObjectMapper() {
-    ObjectMapper mapper = new ObjectMapper();
-    mapper.setAnnotationIntrospector(
-        new JaxbAnnotationIntrospector(TypeFactory.defaultInstance()));
-    mapper.setSerializationInclusion(JsonInclude.Include.NON_NULL);
-    mapper.configure(SerializationFeature.FLUSH_AFTER_WRITE_VALUE, false);
-    mapper.configure(DeserializationFeature.FAIL_ON_UNKNOWN_PROPERTIES, false);
-    return mapper;
-  }
-
   public static String getTFConfigEnv(String componentName, int nWorkers,
       int nPs, String serviceName, String userName, String domain) {
     String commonEndpointSuffix = YarnServiceUtils
@@ -69,34 +48,57 @@ public class TensorFlowConfigEnvGenerator {
       this.endpointSuffix = endpointSuffix;
     }
 
+    // Can't just return standard json string. Because the command,
+    // export TF_CONFIG="json", would omit " in json string. " needs to be
+    // changed to \"
     String toJson() {
-      ObjectNode rootNode = OBJECT_MAPPER.createObjectNode();
+      String json = "{\\\"cluster\\\":{";
 
-      ObjectNode cluster = rootNode.putObject("cluster");
-      createComponentArray(cluster, "master", 1);
-      createComponentArray(cluster, "worker", nWorkers - 1);
-      createComponentArray(cluster, "ps", nPS);
+      String master = getComponentArrayJson("master", 1, endpointSuffix)
+          + ",";
+      String worker = getComponentArrayJson("worker", nWorkers - 1,
+          endpointSuffix) + ",";
+      String ps = getComponentArrayJson("ps", nPS, endpointSuffix) + "},";
 
-      ObjectNode task = rootNode.putObject("task");
-      task.put("type", componentName);
-      task.put("index", "$" + Envs.TASK_INDEX_ENV);
-      task.put("environment", "cloud");
-      try {
-        return OBJECT_MAPPER.writeValueAsString(rootNode);
-      } catch (JsonProcessingException e) {
-        throw new RuntimeException("Failed to serialize TF config env JSON!",
-            e);
-      }
+      StringBuilder sb = new StringBuilder();
+      sb.append("\\\"task\\\":{");
+      sb.append(" \\\"type\\\":\\\"");
+      sb.append(componentName);
+      sb.append("\\\",");
+      sb.append(" \\\"index\\\":");
+      sb.append('$');
+      sb.append(Envs.TASK_INDEX_ENV + "},");
+      String task = sb.toString();
+      String environment = "\\\"environment\\\":\\\"cloud\\\"}";
+
+      sb = new StringBuilder();
+      sb.append(json);
+      sb.append(master);
+      sb.append(worker);
+      sb.append(ps);
+      sb.append(task);
+      sb.append(environment);
+      return sb.toString();
     }
 
-    private void createComponentArray(ObjectNode cluster, String name,
-          int count) {
-      ArrayNode array = cluster.putArray(name);
+    private String getComponentArrayJson(String componentName, int count,
+        String endpointSuffix) {
+      String component = "\\\"" + componentName + "\\\":";
+      StringBuilder array = new StringBuilder();
+      array.append("[");
       for (int i = 0; i < count; i++) {
-        String componentValue = String.format("%s-%d%s", name, i,
-            endpointSuffix);
-        array.add(componentValue);
+        array.append("\\\"");
+        array.append(componentName);
+        array.append("-");
+        array.append(i);
+        array.append(endpointSuffix);
+        array.append("\\\"");
+        if (i != count - 1) {
+          array.append(",");
+        }
       }
+      array.append("]");
+      return component + array.toString();
     }
   }
 }
diff --git a/submarine-server/server-submitter/submitter-yarnservice/src/test/java/org/apache/submarine/server/submitter/yarnservice/tensorflow/TensorFlowConfigEnvGeneratorTest.java b/submarine-server/server-submitter/submitter-yarnservice/src/test/java/org/apache/submarine/server/submitter/yarnservice/tensorflow/TensorFlowConfigEnvGeneratorTest.java
index de1135b..0e30873 100644
--- a/submarine-server/server-submitter/submitter-yarnservice/src/test/java/org/apache/submarine/server/submitter/yarnservice/tensorflow/TensorFlowConfigEnvGeneratorTest.java
+++ b/submarine-server/server-submitter/submitter-yarnservice/src/test/java/org/apache/submarine/server/submitter/yarnservice/tensorflow/TensorFlowConfigEnvGeneratorTest.java
@@ -19,131 +19,53 @@
 
 package org.apache.submarine.server.submitter.yarnservice.tensorflow;
 
-import com.fasterxml.jackson.databind.JsonNode;
-import com.fasterxml.jackson.databind.ObjectMapper;
-import com.fasterxml.jackson.databind.node.ArrayNode;
-import com.fasterxml.jackson.databind.node.JsonNodeType;
-import org.junit.Before;
 import org.junit.Test;
-
-import java.io.IOException;
-
 import static org.junit.Assert.assertEquals;
-import static org.junit.Assert.assertNotNull;
 
 /**
  * Class to test some functionality of {@link TensorFlowConfigEnvGenerator}.
  */
 public class TensorFlowConfigEnvGeneratorTest {
-  private ObjectMapper objectMapper;
-
-  @Before
-  public void setup() {
-    objectMapper = new ObjectMapper();
-  }
-
-  private void verifyCommonJsonData(JsonNode node, String taskType) {
-    JsonNode task = node.get("task");
-    assertNotNull(task);
-    assertEquals(taskType, task.get("type").asText());
-    assertEquals("$_TASK_INDEX", task.get("index").asText());
-
-    JsonNode environment = task.get("environment");
-    assertNotNull(environment);
-    assertEquals("cloud", environment.asText());
-  }
-
-  private void verifyArrayElements(JsonNode node, String childName,
-        String... elements) {
-    JsonNode master = node.get(childName);
-    assertNotNull(master);
-    assertEquals(JsonNodeType.ARRAY, master.getNodeType());
-    ArrayNode masterArray = (ArrayNode) master;
-    verifyArray(masterArray, elements);
-  }
-
-  private void verifyArray(ArrayNode array, String... elements) {
-    int arraySize = array.size();
-    assertEquals(elements.length, arraySize);
-
-    for (int i = 0; i < arraySize; i++) {
-      JsonNode arrayElement = array.get(i);
-      assertEquals(elements[i], arrayElement.asText());
-    }
-  }
 
   @Test
-  public void testSimpleDistributedTFConfigGeneratorWorker()
-      throws IOException {
+  public void testSimpleDistributedTFConfigGeneratorWorker() {
     String json = TensorFlowConfigEnvGenerator.getTFConfigEnv("worker", 5, 3,
             "wtan", "tf-job-001", "example.com");
-
-    JsonNode jsonNode = objectMapper.readTree(json);
-    assertNotNull(jsonNode);
-    JsonNode cluster = jsonNode.get("cluster");
-    assertNotNull(cluster);
-
-    verifyArrayElements(cluster, "master",
-        "master-0.wtan.tf-job-001.example.com:8000");
-    verifyArrayElements(cluster, "worker",
-        "worker-0.wtan.tf-job-001.example.com:8000",
-        "worker-1.wtan.tf-job-001.example.com:8000",
-        "worker-2.wtan.tf-job-001.example.com:8000",
-        "worker-3.wtan.tf-job-001.example.com:8000");
-
-    verifyArrayElements(cluster, "ps",
-        "ps-0.wtan.tf-job-001.example.com:8000",
-        "ps-1.wtan.tf-job-001.example.com:8000",
-        "ps-2.wtan.tf-job-001.example.com:8000");
-
-    verifyCommonJsonData(jsonNode, "worker");
+    assertEquals(json, "{\\\"cluster\\\":{\\\"master\\\":[\\\"master-0.wtan" +
+        ".tf-job-001.example.com:8000\\\"],\\\"worker\\\":[\\\"worker-0.wtan" +
+        ".tf-job-001.example.com:8000\\\",\\\"worker-1.wtan.tf-job-001" +
+        ".example.com:8000\\\",\\\"worker-2.wtan.tf-job-001.example" +
+        ".com:8000\\\",\\\"worker-3.wtan.tf-job-001.example.com:8000\\\"]," +
+        "\\\"ps\\\":[\\\"ps-0.wtan.tf-job-001.example.com:8000\\\",\\\"ps-1" +
+        ".wtan.tf-job-001.example.com:8000\\\",\\\"ps-2.wtan.tf-job-001" +
+        ".example.com:8000\\\"]},\\\"task\\\":{ \\\"type\\\":\\\"worker\\\", " +
+        "\\\"index\\\":$_TASK_INDEX},\\\"environment\\\":\\\"cloud\\\"}");
   }
 
   @Test
-  public void testSimpleDistributedTFConfigGeneratorMaster()
-      throws IOException {
+  public void testSimpleDistributedTFConfigGeneratorMaster() {
     String json = TensorFlowConfigEnvGenerator.getTFConfigEnv("master", 2, 1,
         "wtan", "tf-job-001", "example.com");
-
-    JsonNode jsonNode = objectMapper.readTree(json);
-    assertNotNull(jsonNode);
-    JsonNode cluster = jsonNode.get("cluster");
-    assertNotNull(cluster);
-
-    verifyArrayElements(cluster, "master",
-        "master-0.wtan.tf-job-001.example.com:8000");
-    verifyArrayElements(cluster, "worker",
-        "worker-0.wtan.tf-job-001.example.com:8000");
-
-    verifyArrayElements(cluster, "ps",
-        "ps-0.wtan.tf-job-001.example.com:8000");
-
-    verifyCommonJsonData(jsonNode, "master");
+    assertEquals(json, "{\\\"cluster\\\":{\\\"master\\\":[\\\"master-0.wtan" +
+        ".tf-job-001.example.com:8000\\\"],\\\"worker\\\":[\\\"worker-0.wtan" +
+        ".tf-job-001.example.com:8000\\\"],\\\"ps\\\":[\\\"ps-0.wtan" +
+        ".tf-job-001.example.com:8000\\\"]},\\\"task\\\":{ " +
+        "\\\"type\\\":\\\"master\\\", \\\"index\\\":$_TASK_INDEX}," +
+        "\\\"environment\\\":\\\"cloud\\\"}");
   }
 
   @Test
-  public void testSimpleDistributedTFConfigGeneratorPS() throws IOException {
+  public void testSimpleDistributedTFConfigGeneratorPS() {
     String json = TensorFlowConfigEnvGenerator.getTFConfigEnv("ps", 5, 3,
         "wtan", "tf-job-001", "example.com");
-
-    JsonNode jsonNode = objectMapper.readTree(json);
-    assertNotNull(jsonNode);
-    JsonNode cluster = jsonNode.get("cluster");
-    assertNotNull(cluster);
-
-    verifyArrayElements(cluster, "master",
-        "master-0.wtan.tf-job-001.example.com:8000");
-    verifyArrayElements(cluster, "worker",
-        "worker-0.wtan.tf-job-001.example.com:8000",
-        "worker-1.wtan.tf-job-001.example.com:8000",
-        "worker-2.wtan.tf-job-001.example.com:8000",
-        "worker-3.wtan.tf-job-001.example.com:8000");
-
-    verifyArrayElements(cluster, "ps",
-        "ps-0.wtan.tf-job-001.example.com:8000",
-        "ps-1.wtan.tf-job-001.example.com:8000",
-        "ps-2.wtan.tf-job-001.example.com:8000");
-
-    verifyCommonJsonData(jsonNode, "ps");
+    assertEquals(json, "{\\\"cluster\\\":{\\\"master\\\":[\\\"master-0.wtan" +
+        ".tf-job-001.example.com:8000\\\"],\\\"worker\\\":[\\\"worker-0.wtan" +
+        ".tf-job-001.example.com:8000\\\",\\\"worker-1.wtan.tf-job-001" +
+        ".example.com:8000\\\",\\\"worker-2.wtan.tf-job-001.example" +
+        ".com:8000\\\",\\\"worker-3.wtan.tf-job-001.example.com:8000\\\"]," +
+        "\\\"ps\\\":[\\\"ps-0.wtan.tf-job-001.example.com:8000\\\",\\\"ps-1" +
+        ".wtan.tf-job-001.example.com:8000\\\",\\\"ps-2.wtan.tf-job-001" +
+        ".example.com:8000\\\"]},\\\"task\\\":{ \\\"type\\\":\\\"ps\\\", " +
+        "\\\"index\\\":$_TASK_INDEX},\\\"environment\\\":\\\"cloud\\\"}");
   }
 }


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@submarine.apache.org
For additional commands, e-mail: dev-help@submarine.apache.org